Python Standard Libraries
Pipenv, originally started as a
weekend project by the awesome
Kenneth Reitz, aims to bring ideas from other package managers (such as
npm or
yarn) into the Python world. Forget about installing
virtualenv,
virtualenvwrapper, managing
requirements.txt
files and ensuring reproducibility with regards to versions of dependencies of the dependencies (read
here for more info about this). With Pipenv, you specify all your dependencies in a
Pipfile
— which is normally built by using commands for adding, removing, or updating dependencies. The tool can generate a
Pipfile.lock
file, enabling your builds to be
deterministic, helping you avoid those difficult to catch bugs because of some obscure dependency that you didn’t even think you needed.
Of course, Pipenv comes with many other perks and has
great documentation, so make sure to check it out and start using it for all your Python projects, as we do at Tryolabs :)
If there is a library whose popularity has boomed this year, especially in the Deep Learning (DL) community, it’s PyTorch, the DL framework introduced by Facebook this year.
PyTorch builds on and improves the (once?) popular Torch framework, especially since it’s Python based — in contrast with Lua. Given how people have been switching to Python for doing data science in the last couple of years, this is an important step forward to make DL more accessible.
Most notably, PyTorch has become one of the go-to frameworks for many researchers, because of its implementation of the novel
Dynamic Computational Graph paradigm. When writing code using other frameworks like
TensorFlow,
CNTK or
MXNet, one must first define something called a
computational graph. This graph specifies all the operations that will be run by our code, which are later
compiled and potentially
optimized by the framework, in order to allow for it to be able to run even faster, and in parallel on a GPU. This paradigm is called
Static Computational Graph, and is great since you can leverage all sorts of optimizations and the graph, once built, can potentially run in different devices (since
execution is separate from
building). However, in many tasks such as Natural Language Processing, the amount of “work” to do is often variable: you can resize images to a fixed resolution before feeding them to an algorithm, but cannot do the same with sentences which come in variable length. This is where PyTorch and dynamic graphs shine, by letting you use standard Python control instructions in your code, the graph will be defined when it is executed, giving you a lot of freedom which is essential for several tasks.
It might sound crazy, but Facebook also released another great DL framework this year.
The original
Caffe framework has been widely used for years, and known for unparalleled performance and battle-tested codebase. However, recent trends in DL made the framework stagnate in some directions. Caffe2 is the attempt to bring Caffe to the modern world.
It supports distributed training, deployment (even in mobile platforms), the newest CPUs and CUDA-capable hardware. While PyTorch may be better for research, Caffe2 is suitable for large scale deployments as seen on Facebook.
Also, check out the
recent ONNX effort. You can build and train your models in PyTorch, while using Caffe2 for deployment! Isn’t that great?
Previously,
Arrow, a library that aims to make your life easier while working with datetimes in Python, made the list. This year, it is the turn of Pendulum.
One of Pendulum’s strength points is that it is a drop-in replacement for Python’s standard datetime
class, so you can easily integrate it with your existing code, and leverage its functionalities only when you actually need them. The authors have put special care to ensure timezones are handled correctly, making every instance timezone-aware and UTC by default. You will also get an extended timedelta
to make datetime arithmetic easier.
Unlike other existing libraries, it strives to have an API with predictable behavior, so you know what to expect. If you are doing any non trivial work involving datetimes, this will make you happier! Check out
the docs for more.
You are doing data science, for which you use the excellent available tools in the Python ecosystem like
Pandas and
scikit-learn. You use
Jupyter Notebooks for your workflow, which is great for you and your colleagues. But how do you share the work with people who do not know how to use those tools? How do you build an interface so people can easily play around with the data, visualizing it in the process? It used to be the case that you needed a dedicated frontend team, knowledgeable in Javascript, for building these GUIs. Not anymore.
Dash,
announced this year, is an
open source library for building web applications, especially those that make good use of data visualization, in pure Python. It is built on top of
Flask,
Plotly.js and
React, and provides abstractions that free you from having to learn those frameworks and let you become productive quickly. The apps are rendered in the browser and will be responsive so they will be usable in mobile devices.
If you would like to know more about what is possible with Dash, the
Gallery is a great place for some eye-candy.
There are many libraries in Python for doing data science and ML, but when your data points are metrics that evolve over time (such as stock prices, measurements obtained from instruments, etc), that is not the case.
PyFlux is an open source library in Python built specifically for working with time series. The study of time series is a subfield of statistics and econometrics, and the goals can be describing how time series behave (in terms of latent components or features of interest), and also predicting how they will behave the future.
PyFlux allows for a probabilistic approach to time series modeling, and has implementations for several modern time series models like
GARCH. Neat stuff.
It is often the case that you need to make a Command Line Interface (CLI) for your project. Beyond the traditional
argparse, Python has some great tools like
click or
docopt. Fire,
announced by Google this year, has a different take on solving this same problem.
Fire is an open source library that can automatically generate a CLI for any Python project. The key here is automatically: you almost don’t need to write any code or docstrings to build your CLI! To do the job, you only need to call a Fire
method and pass it whatever you want turned into a CLI: a function, an object, a class, a dictionary, or even pass no arguments at all (which will turn your entire code into a CLI).
Make sure to read
the guide so you understand how it works with examples. Keep it under your radar, because this library can definitely save you a lot of time in the future.
In an ideal world, we would have perfectly balanced datasets and we would all train models and be happy. Unfortunately, the real world is not like that, and certain tasks favor very imbalanced data. For example, when predicting fraud in credit card transactions, you would expect that the vast majority of the transactions (+99.9%?) are actually legit. Training ML algorithms naively will lead to dismal performance, so extra care is needed when working with these types of datasets.
Fortunately, this is a studied research problem and a variety of techniques exist. Imbalanced-learn is a Python package which offers implementations of some of those techniques, to make your life much easier. It is compatible with
scikit-learnand is part of
scikit-learn-contrib projects. Useful!
When you need to search for some text and replace it for something else, as is standard in most data-cleaning work, you usually turn to regular expressions. They will get the job done, but sometimes it happens that the number of terms you need to search for is in the thousands, and then, reg exp can become painfully slow to use.
FlashText is a better alternative just for this purpose. In the
author’s initial benchmark, it improved the runtime of the entire operation by a huge margin: from 5 days to 15 minutes. The beauty of FlashText is that the runtime is the same no matter how many search terms you have, in contrast with regexp in which the runtime will increase almost linearly with the number of terms.
FlashText is a testimony to the importance of the design of algorithms and data structures, showing that, even for simple problems, better algorithms can easily outdo even the fastest CPUs running naive implementations.
Disclaimer: this library was built by Tryolabs’ R&D area.
Images are everywhere nowadays, and understanding their content can be critical for several applications. Thankfully, image processing techniques have advanced a lot, fueled by the advancements in DL.
Luminoth is an open source Python toolkit for computer vision, built using
TensorFlow and
Sonnet. Currently, it out-of-the-box supports object detection in the form of a model called Faster R-CNN.
But Luminoth is not only an implementation of a particular model. It is built to be modular and extensible, so customizing the existing pieces or extending it with new models to tackle different problems should be straightforward, with as much code reuse as there can be. It provides tools for easily doing the engineering work that are needed when building DL models: converting your data (in this case, images) to adequate format for feeding your data pipeline (
TensorFlow’s tfrecords), doing data augmentation, running the training in one or multiple GPUs (distributed training will be a must when working with large datasets), running evaluation metrics, easily
visualizing stuff in TensorBoard and deploying your trained model with a simple API or browser interface, so people can play around with it.
Moreover, Luminoth has straightforward integration with
Google Cloud’s ML Engine, so even if you don’t own a powerful GPU, you can train in the cloud with a single command, just as you do in your own local machine.
Bonus: watch out for these
You may have never heard of the
libvips library. In that case, you must know that it’s an image processing library, like
Pillow or
ImageMagick, and supports a wide range of formats. However, when comparing to other libraries,
libvips is faster and uses less memory. For example,
some benchmarks show it to be about 3x faster and use less than 15x memory as ImageMagick. You can read more about why libvips is nice
here.
PyVips is a recently released Python binding for libvips, which is compatible with Python 2.7-3.6 (and even PyPy), easy to install with pip
and drop-in compatible with the old binding, so if you are using that, you don’t have to modify your code.
If doing some sort of image processing in your app, definitely something to keep an eye on.
Disclaimer: this library was built by Tryolabs.
Sometimes, you need to automatize some actions in the web. Be it when scraping sites, doing application testing, or filling out web forms to perform actions in sites that do not expose an API, automation is always necessary. Python has the excellent
Requests library which allows you perform some of this work, but unfortunately (or not?) many sites make heavy client side use of Javascript. This means that the HTML code that Requests fetches, in which you could be trying to find a form to fill for your automation task, may not even have the form itself! Instead, it will be something like an empty
div of some sort that will be generated in the browser with a modern frontend library such as
React or
Vue.
One way to solve this is to reverse-engineer the requests that Javascript code makes, which will mean many hours of debugging and fiddling around with (probably) uglified JS code. No thanks. Another option is to turn to libraries like
Selenium, which allow you to programmatically interact with a web browser and run the Javascript code. With this, the problems are no more, but it is still slower than using plain Requests which adds very little overhead.
Wouldn’t it be cool if there was a library that let you start out with Requests and seamlessly switch to Selenium, only adding the overhead of a web browser when actually needing it? Meet Requestium, which acts as a drop-in replacement for Requests and does just that. It also integrates
Parsel, so writing all those selectors for finding the elements in the page is much cleaner than it would otherwise be, and has helpers around common operations like clicking elements and making sure stuff is actually rendered in the DOM. Another time saver for your web automation projects!
You like the awesome API of scikit-learn, but need to do work using PyTorch? Worry not, skorch is a wrapper which will give PyTorch an interface like sklearn. If you are familiar with those libraries, the syntax should be straightforward and easy to understand. With skorch, you will get some code abstracted away, so you can focus more on the things that really matter, like doing your data science.
Since the release of
AWS Lambda (and
others that
have followed), all the rage has been about
serverless architectures. These allow microservices to be deployed in the cloud, in a fully managed environment where one doesn’t have to care about managing any server, but is assigned stateless, ephemeral
computing containers that are fully managed by a provider. With this paradigm, events (such as a traffic spike) can trigger the execution of more of these
containers and therefore give the possibility to handle “infinite” horizontal scaling.
Zappa is the serverless framework for Python, although (at least for the moment) it only has support for AWS Lambda and AWS API Gateway. It makes building so-architectured apps very simple, freeing you from most of the tedious setup you would have to do through the AWS Console or API, and has all sort of commands to ease deployment and managing different environments.
Who said Python couldn’t be fast? Apart from competing for the
best name of a software library ever, Sanic also competes for the fastest Python web framework ever, and appears to be the winner by a clear margin. It is a Flask-like Python 3.5+ web server that is designed for speed. Another library,
uvloop, is an ultra fast drop-in replacement for
asyncio’s event loop that uses
libuv under the hood. Together, these two things make a great combination!
According to the Sanic author’s
benchmark,
uvloop could power this beast to handle more than
33k requests/s which is just insane (and faster than
node.js). Your code can benefit from the new
async/await syntax so it will look neat too; besides we love the Flask-style API. Make sure to give Sanic a try, and if you are using
asyncio, you can surely benefit from
uvloop with very little change in your code!
In line with recent developments for the
asyncio framework, the folks from
MagicStack bring us this efficient asynchronous (currently CPython 3.5 only) database interface library designed specifically for PostgreSQL. It has zero dependencies, meaning there is no need to have
libpq installed. In contrast with
psycopg2 (the most popular PostgreSQL adapter for Python) which exchanges data with the database server in text format,
asyncpg implements PostgreSQL
binary I/O protocol, which not only allows support for generic types but also comes with numerous performance benefits.
The benchmarks are clear: asyncpg is on average, at least 3x faster than psycopg2(or aiopg), and faster than the node.js and Go implementations.
If you have your infrastructure on AWS or otherwise make use of their services (such as S3), you should be very happy that
boto, the Python interface for AWS API, got a completely rewrite from the ground up. The great thing is that you don’t need to migrate your app all at once: you can use
boto3 and
boto (2)
at the same time; for example using boto3 only for new parts of your application.
The new implementation is much more consistent between different services, and since it uses a data-driven approach to generate classes at runtime from JSON description files, it will always get fast updates. No more lagging behind new Amazon API features, move to boto3!
Do we even need an introduction here? Since it was released by Google in November 2015, this library has gained a huge momentum and has become the #1 trendiest GitHub Python repository. In case you have been living under a rock for the past year, TensorFlow is a library for numerical computation using data flow graphs, which can run over GPU or CPU.
We have quickly witnessed it become a trend in the Machine Learning community (especially Deep Learning, see our post on
10 main takeaways from MLconf), not only growing its uses in research but also being widely used in production applications. If you are doing Deep Learning and want to use it through a higher level interface, you can try using it as a backend for
Keras (which made it to last years post) or the newer
TensorFlow-Slim.
If you are into AI, you surely have heard about the
OpenAI non-profit artificial intelligence research company (backed by Elon Musk et al.). The researchers have open sourced some Python code this year! Gym is a toolkit for developing and comparing
reinforcement learning algorithms. It consists of an open-source library with a collection of test problems (environments) that can be used to test reinforcement learning algorithms, and a site and API that allows to compare the performance of trained algorithms (agents). Since it doesn’t care about the implementation of the agent, you can build them with the computation library of your choice: bare numpy, TensorFlow, Theano, etc.
We also have the recently released universe, a software platform for researching into general intelligence across games, websites and other applications. This fits perfectly with gym, since it allows any real-world application to be turned into a gymenvironment. Researchers hope that this limitless possibility will accelerate research into smarter agents that can solve general purpose tasks.
You may be familiar with some of the libraries Python has to offer for data visualization; the most popular of which are
matplotlib and
seaborn. Bokeh, however, is created for
interactive visualization, and targets modern web browsers for the presentation. This means Bokeh can create a plot which lets you
explore the data from a web browser. The great thing is that it integrates tightly with
Jupyter Notebooks, so you can use it with your probably go-to tool for your research. There is also an optional server component,
bokeh-server
, with many powerful capabilities like server-side downsampling of large dataset (no more slow network tranfers/browser!), streaming data, transformations, etc.
Make sure to check the
gallery for examples of what you can create. They look awesome!
Sometimes, you want to run analytics over a dataset too big to fit your computer’s RAM. If you cannot rely on numpy or Pandas, you usually turn to other tools like PostgreSQL, MongoDB, Hadoop, Spark, or many others. Depending on the use case, one or more of these tools can make sense, each with their own strengths and weaknesses. The problem? There is a big overhead here because you need to learn how each of these systems work and how to insert data in the proper form.
Blaze provides a
uniform interface that abstracts you away from several database technologies. At the core, the library provides a way to
express computations. Blaze itself doesn’t actually do any computation: it just knows how to instruct a specific
backend who will be in charge of performing it. There is so much more to Blaze (thus the ecosystem), as libraries that have come out of its development. For example,
Dask implements a drop-in replacement for NumPy array that can handle content larger than memory and leverage multiple cores, and also comes with dynamic task scheduling. Interesting stuff.
There is a famous saying that there are only two hard problems in Computer Science: cache invalidation and naming things. I think the saying is clearly missing one thing: managing datetimes. If you have ever tried to do that in Python, you will know that the standard library has a gazillion modules and types: datetime
, date
, calendar
, tzinfo
, timedelta
, relativedelta
, pytz
, etc. Worse, it is timezone naive by default.
Arrow is “datetime for humans”, offering a sensible approach to creating, manipulating, formatting and converting dates, times, and timestamps. It is a replacement for the datetime
type that supports Python 2 or 3, and provides a much nicer interface as well as filling the gaps with new functionality (such as humanize
). Even if you don’t really need arrow, using it can greatly reduce the boilerplate in your code.
Expose your internal API externally, drastically simplifying Python APIdevelopment. Hug is a next-generation Python 3 (only) library that will provide you with the cleanest way to create HTTP REST APIs in Python. It is not a web framework per se (although that is a function it performs exceptionally well), but only focuses on exposing idiomatically correct and standard internal Python APIs externally. The idea is simple: you define logic and structure once, and you can expose your API through multiple means. Currently, it supports exposing REST API or command line interface.
How hard would be for a painter to paint without seeing immediately the results of what he is doing? Jupyter Notebooks makes it easy to interact with code, plots and results, and is becoming one of the preferred tools for data scientists. These Notebooks are documents which combine live code and documentation. For this reason, it is our go to tool for creating fast prototypes or tutorials.
Although we use Jupyter for writing Python code only, nowadays it has added support for other programming languages such as Julia or Haskell.
The retrying library helps you to avoid reinventing the wheel: it implements a retrying behavior for you. It provides a generic decorator which makes giving retrying abilities to any method effortless, as also has a bunch of properties you can set in order to have the desired retrying behavior such as maximum number of attempts, delay, backoff sleeping, error conditions, etc. Small and simple.
As of 2015, the most important libraries have all been ported to Python 3, so we started embracing it. We really liked asyncio for writing concurrent code using coroutines, so we had the need for an HTTP client (such as
requests) and server using the same concurrency paradigm. The
aiohttp library is such, providing a clean and easy to use HTTP client/server for asyncio.
We have tried several solutions for subprocess wrappers in order to call other scripts or executables from Python programs, but the model of plumbum blows them all away. With an easy to use syntax you can execute local or remote commands, get the output or error codes in a cross-platform way, and if that were not enough, you get composability (a la shell pipes) and an interface for building command line applications. Give it a try!
Working with and validating phone numbers can be a real pain, as there are international prefixes and area codes to take into account, and possibly other things depending on the country. The phonenumbers Python library is a port of Google’s libphonenumbers which thankfully simplifies this. It that can be used to parse, format and validate phone numbers with very little code involved. Most importantly, phonenumbers can tell whether a phone number is unique or not (following the E.164 format). It also works on both, Python 2 and Python 3.
We have used this library extensively in many projects, mostly through its adaptation
django-phonenumber-field, as a way to solve this tedious problem that pretty much always pops up.
Graphs and networks are tools often used for many different tasks, such as organizing data or showing it’s flow or representing relations between entities. NetworkX allows the creation and manipulation of graphs and networks. The algorithms used in NetworkX make it highly scalable, allowing it to be ideal when working with large graphs is required. Moreover, there are tons of options for rendering graphs making it an awesome visualization tool too.
If you are thinking about storing loads of data in a time-series basis, then you have to consider using
InfluxDB. InfluxDB is a time-series database we have been using to store measurements over time. Through a RESTFul API, it’s super easy to use and very efficient, which is a must when talking about lot of data. Additionally, retrieving and grouping data is painless due its built-in clustering functionalities. This official client abstracts away most of the work with invoking the API, although we would really like to see it improved by implementing a Pythonic way to create queries instead of writing the raw JSONs.
If you have ever used
Elasticsearch you surely have suffered going over those long queries in JSON format, wasting time trying to find out where the parsing error is. The
Elasticsearch DSL client is built upon the official Elasticsearch client and frees you from having to worry about JSONs again: you simply write everything using Python defined classes or queryset-like expressions. It also provides wrappers for working with documents as Python objects, mappings, etc.
Deep learning is the new trend, and here is where
keras shines. It can run on top of
Theano and allows fast experimentation with a variety of Neural Networks architectures. Highly modular and minimalistic, it can run seamlessly on CPU and GPU. Having something like
keras was key for some of the R&D projects we tackled in 2015.
If you are into NLP (Natural Language Processing) and haven’t heard about Gensim, you are living under a rock. It provides fast and scalable (memory independent) implementations of some of the most used algorithms such as tf-idf, word2vec, doc2vec, LSA, etc, as well as an easy to use and well documented interface.
Python Bokeh
Installing Jupyter Notebook
o
While Jupyter runs code in many programming languages, Python is
a requirement (Python 3.6 or greater, or Python 2.7) for installing the Jupyter
Notebook.
o
We recommend using the Anaconda distribution to install Python
and Jupyter.
If you are an experienced then install Jupyter using pip
command
Open cmd
Type following cmd:
pip install jupyter
Now type in cmd (jupyter
notebook)
And the browser will open automatically,
Then click on new ->
python3
Now we can type our code in the Python notebook and run the code
Step 3>
installing Anaconda. Anaconda conveniently
installs Python, the Jupyter Notebook, and other commonly used packages for
scientific computing and data science.
Use the following installation steps:
- Download Anaconda.
We recommend downloading Anaconda’s latest Python 3 version (currently
Python 3.6).
the open source Anaconda Distribution is the easiest way
to do Python data science and machine learning. It includes hundreds of popular
data science packages and the conda package and
virtual environment manager for Windows, Linux, and MacOS. Conda makes it quick
and easy to install, run, and upgrade complex data science and machine learning
environments like scikit-learn, TensorFlow, and SciPy. Anaconda
Distribution is the foundation of millions of data science projects as well as
Amazon Web Services' Machine Learning AMIs and Anaconda
for Microsoft on Azure and Windows.
Reproducible Data Science and Machine Learning
The Python and R conda packages in the Anaconda Repository are curated and
compiled in our secure environment so you get optimized binaries that
"just work" on your system. Combined with conda's virtual
environments and deep dependency management, you can easily reproduce exactly
the same data science results across Windows, Linux, and MacOS systems. conda
package builders on anaconda.org.