Virtual Environments

One thing that I’ve found in my transition into Data Science is that no matter how many classes and bootcamps and projects you do, not many resources (that I’ve found so far) really dive into virtual environments and how to incorporate them into your data science workflow. Either there is no mention of them and thus are either deemed unnecessary, favouring diving into the “meat” of the project or task at hand, or are expected prior knowledge. In this post, I hope to elucidate the importance of a virtual environment, how to go about setting one up for Python Data Science projects, and also the process of managing and maintaining an environment. Please note that there are several ways of setting up virtual environments, and different setups may be more or less optimal for other users, but I will cover venv and conda, which I believe are the most accessible, and should be more than enough to get you started. If you’re in Windows, a) I’m sorry, and b) perhaps this resource would be more useful for you.

What is a virtual environment, and why should I care?#

Much like it sounds, a virtual environment is an isolated environment, segregated from your global machine, that allows for version control – particularly in relation to individual projects. Since Python can release 10 or more new versions a year, maintaining some kind of consistency is crucial when you want assurance that your work is unaffected by changes when an update has taken place. Nothing is worse than realising that your system has updated, going back to a project that needs a small tweak, and realising that there are more breaks than you know what to do with. Whether you use your work or home computer for your Python projects, your system can also become a little crowded with packages and libraries – especially if you’re like me and can’t wait to get your teeth into a new project – and managing the different dependencies of interrelated libraries and packages can be exhausting. From a backend perspective, there is also the consideration of where these modules are stored. Modules imported in Python (using the import function) are stored in a different location to packages imported at the command-line using pip – the difference lies in whether the modules are part of the Python standard library or not. I won’t go into this any further here, but I will refer you to this article if you’re interested in learning more.

Knowing what each project actually needs is important when you’re building out projects for another user, team, or client. A virtual environment allows you to monitor your imported libraries, controls which version of Python you are using and manages dependencies with different packages, and ultimately allows you to pass on the requirements of a project alongside the project code itself. It also allows for you to set up different dependencies for different projects (have a client that only works in Python2? No problem!).

Which virtual environment is right for me?#

In this tutorial, I’m going to cover two different virtual environments. The first, venv is available by default in Python 3.3 or later. conda environments, on the other hand, can be powerful when used alongside Jupyter notebooks, since it acts as both a package and environment manager and is language agnostic.

Honestly, there is no “best” environment, and your choice will likely come down to how much setup and maintenance you’re comfortable with, as well as whether you need the additional features that virtualenv and conda provide.

Using and managing virtual environments – the workflow with venv#

Ok, so I’m starting a new project, which I’m going to call my-awesome-ds-project. If you’re new to data science, I highly recommend taking a look at cookie cutter data science as a structure for your data science projects. I modify mine slightly to look like this:

my-awesome-ds-project
|––README.md
|––data				#data dump, raw, interim, and processed data
|––models			#trained models, model predictions
|––notebooks		#Jupyter notebooks
|––references 		#explanatory material
|––reports			#presentations, reports, figures
|––src				#source code, scripts for use in the project
|––tests			#making sure my code does what I think it does

So far, so good! To set up a virtual environment, all you need to do is navigate to your project directory and execute the venv module (I’m going to call mine awesome_venv, but feel free to name it whatever you please):

% cd my-awesome-ds-project
% python3 -m venv awesome_venv/

Now your new directory awesome_venv is inside your project folder:

my-awesome-ds-project
|––README.md
|––awesome_venv		#oh, hello there
|––data				
|––models			
|––notebooks		
|––references 		
|––reports			
|––src				
|––tests			

Now, if you were to take a peek inside your new venv folder, it would look something like this:

awesome_venv
├── bin
│   ├── activate
│   ├── activate.csh
│   ├── activate.fish
│   ├── easy_install
│   ├── easy_install-3.5
│   ├── pip
│   ├── pip3
│   ├── pip3.5
│   ├── python -> python3.5
│   ├── python3 -> python3.5
│   └── python3.5 -> /Library/Frameworks/Python.framework/Versions/3.5/bin/python3.5
├── include
├── lib
│   └── python3.5
│       └── site-packages
└── pyvenv.cfg

Not wanting to overcomplicate things here, but an overview of the three main subfolders:

  • bin contains files that interact with the virtual environment
  • include contains C headers that compile the Python packages
  • lib contains a copy of the Python version along with the site-packages where each module or package is managed

To activate the new virtual environment, we need to use the bin/activate path:

% source awesome_venv/bin/activate
(awesome-venv) %					

Look at that! The command line prompt now shows that you’re working inside your virtual environment. Now we can really get our teeth into our project! By default, only pip and setuptools are installed inside a virtual environment, so any additional library you need to use will need to be installed separately, using the pip install method below. If you’re used to doing all your code at the command line or in .py files, then you can work on your project as usual at this point. But if you’re using Jupyter notebooks, then a little more setup is required. While in your virtual environment, run the following, which will install the jupyter notebook onto your virtual environment (you will only need to do this once per virtual environment you set up):

(awesome-venv) % pip install jupyter notebook
(awesome-venv) % pip install ipykernel==4.8.2 #syntax for installing a specific version
(awesome-venv) % python -m ipykernel install --user
(awesome-venv) % jupyter notebook notebooks/my-file.ipynb #leave the last argument blank to open a new notebook

This gets you up and going straight away, without needing to go through the Anaconda system. The ipykernel module allows you to use your virtual environment from within Jupyter notebooks. So, here are the bare bones of a notebook. First of all, you can see in the top right corner that we are in the global Python3 kernel (it’s not obvious, but all will become clear in a moment) From there, you just need to switch into the venv from the kernel menu: Now that your environment is set up, you can almost work on your projects as normal – just remember to install the modules you use before you use them, otherwise you will have a horrible warning message: To do this inside Jupyter notebooks, use the following command (the ! allows you to use command line arguments from within your notebook) You can then continue with your project (I’m still mourning the end of Schitt’s Creek) And that’s all there is to it! Before I move onto the conda method, I just want to go over how to manage the venv environment.

Managing environments#

So it might have occurred to you that installing packages one-by-one might get a little tedious. Luckily, there’s a solution for that. But before I get on to how to install those dependencies en masse, let’s take a little detour (there’s a point, I promise).

So you’re working on a project, and your colleague, Jenny, decides that the awesome new model you built could be useful for something that she’s working on. The question is, how does she set up her environment so that she can be sure that the model will run on her system without a hitch? Enter the requirements.txt file, which allows us to make our environment reproducible for others and, thankfully, is super simple to create.

(awesome-venv) % pip list #not strictly necessary
(awesome-venv) % pip freeze > requirements.txt

After completing the second command, your project folder will look like this:

my-awesome-ds-project
|––README.md
|––awesome_venv		
|––data				
|––models			
|––notebooks		
|––references 		
|––reports	
|––requirements.txt			#hello!
|––src				
|––tests

This is super important when distributing your project because, if Jenny were to try to pull down your project from your GitHub repository and use your virtual environment, it could cause major problems, for instance, if she’s running on a different operating system. To minimise the guesswork, it’s best to stick with requirements.txt files (for now at least), and remove your venv files when uploading to a repository.

Coming back to the issue of installing multiple dependencies at once, all Jenny would need to do, once she’s cloned your repository, is to create her own virtual environment and install the requirements.txt file:

Jenny% cd test-project
Jenny% python3 -m venv test-venv/
Jenny% source test-venv/bin/activate
(test-venv) Jenny% pip install -r requirements.txt

Now Jenny is up and ready to go!

One final thing. When you’re done in your environment and want to go back to running commands in your global machine, use the deactivate command to return to the “status-quo”:

(awesome-venv) % deactivate
%

Troubleshooting#

Of course, life isn’t easy. There might still be some issues that crop up with dependencies between your global and virtual machines. If that happens, the easiest fix is to remove the old virtual environment, and recreate it from your own requirements.txt file:

% rm -r awesome-venv/
% python3 -m venv wonder-venv/
% source wonder-venv/bin/activate
(wonder-venv) % pip install -r requirements.txt

You should be operational at this point, but if not, I refer you to the forums of many frustrated venv users.

Using and managing virtual environments – the workflow with conda#

Now, the very cool thing about conda environments is that a) the environment is language agnostic, which means that you can use other languages if you’re multilingual (I won’t be doing this here, but for a very good post, read https://towardsdatascience.com/a-guide-to-conda-environments-bc6180fc533) and b) conda does the job of both pip and venv and both manages the packages you install and the environment itself. Neat, huh?

To create an environment with conda for Python, you can use the following command (replacing awesome-conda-env with the name of your choice):

% conda create --name awesome-conda-env python

Conda is smart. If there is a specific library you need to install, you could simply name the package during the installation step, and conda would work out which dependencies and language you are using. The recommendation from conda is that all packages are installed at once for this reason – because dependencies are managed during installation, there may be a problem later on – the last package you install typically will determine the outcome of dependency conflicts. See the documentation for more details.

% conda create --name awesome-conda-env numpy

Finally, you can activate your environment and install packages:

% conda activate awesome-conda-env
(awesome-conda-env) % conda install numpy=1.20.1 pandas=1.2.4
(awesome-conda-env) % conda install jupyter notebook

By explicitly telling your environment the specific packages you wish to use, you essentially “pin” the environment to using these versions unless you update them manually.

Some packages are not available through the [Anaconda Repository()], but instead can be found through Anaconda Cloud, which provides access to various channels, including conda-forge and bioconda . For these, a small modification will be needed:

(awesome-conda-env) % conda install --channel conda-forge ipykernel
  • Conda has approximately 25,000 available libraries through Anaconda Repository, conda-forge, and bioconda, but this pales in comparison to the ~150,000 available through PyPI. As a result, you may still need to use pip to install some libraries, as follows – note that you do not have to install pip separately because it is already included within anaconda:
(awesome-conda-env) % pip install datetime

Note that the recommendation from anaconda is that you install all Anaconda packages first, and install pip environments at the end of your modifications to your environment. The reason for this is that the conda environment is less able to manage dependencies for pip-installed libraries, and modifications made after installing libraries via pip might break your environment. Thankfully, conda keeps track of where your packages where installed from (Note: there are many automatically-installed packages to manage dependencies, but for ease of reading I have trimmed the view below):

(awesome-conda-env) % conda list
# packages in environment at /Users/user_name/opt/anaconda3/envs/awesome-conda-env:
#
# Name                    Version                   Build  Channel
datetime                  4.3                      pypi_0    pypi
ipykernel                 5.5.5            py39h71a6800_0    conda-forge
ipython                   7.23.1           py39h71a6800_0    conda-forge
jupyter_client            6.1.12             pyhd8ed1ab_0    conda-forge
jupyter_core              4.7.1            py39h6e9494a_0    conda-forge
numpy                     1.20.1           py39hd6e1bb9_0
numpy-base                1.20.1           py39h585ceec_0
pandas                    1.2.4            py39h23ab428_0
pip                       21.0.1           py39hecd8cb5_0
python                    3.9.4                h88f2d9e_0

One small point to note is that unlike venv, where your environment is a directory in your project folder, creating conda virtual environments as in my example above is easiest, but will require that you name each of your environments different names, since they will all be stored in the same env directory. To list them, use the following command (note the * indicates the default env – since I performed the command while inside my active environment, it shows awesome-conda-env as being default, but outside the env, your base will be default):

(awesome-conda-env) % conda env list
# conda environments:
#
base                     /Users/user_name/opt/anaconda3
awesome-conda-env     *  /Users/user_name/opt/anaconda3/envs/awesome-conda-env

Finally, to get up and running in Jupyter notebooks as before, you will need to run the following:

python -m ipykernel install --user --name=awesome-conda-env

There are some tricky things about conda, such as where your environments live and different installation methods based on where the packages can be found – be that Anaconda Repository, other Conda repositories, such as Anaconda Cloud, or the PyPI library. It’s worth seeing whether the package you want to use is available (and updated frequently) through Anaconda before installing via pip to avoid dependency hell. However, the advantages of using conda might outweigh these difficulties. Because Conda is a cross-platform package, packages installed through Conda are binaries, which means that you don’t require a compiler to install them and, helpfully, this enables packages to be available in C, R, or other languages. As a result, the conda environment manager can contain software written in any languages, and has a satisfiability solver to verify that all requirements of all packages installed in an environment are met (and will notify you if there is a discrepancy), so long as you use conda’s package manager rather than pip. In this sense, conda environments can be a little more reliable. Additionally, you can potentially work with different versions of Python in the same environment (if that is something you need). This article is a great resource for understanding more of the advantages of using conda over pip.

Managing your conda environment#

As before, we need to make our environment reproducible for others by including a file in the project’s root directory listing packages installed along with their version numbers. With conda, instead of creating a requirements.txt file, we create an environment.yml file as follows:

(awesome-conda-env) % conda env export --file environment.yml

Given an environment.yml file, you can then easily re-create an environment:

% conda env create --name test-env --file /path/to/environment.yml

Note that, unfortunately .yml and .txt files cannot be used interchangeably between environments, so this might be something to consider when working on a team with people who mostly use venv, for instance.

One really cool thing about conda is that if you are someone who uses a core set of packages, say numpy and pandas, you can set up your anaconda configuration so that these packages are installed as standard in any new environment:

% conda config --add create_default_packages numpy=1.20.1 pandas=1.2.4

This can ultimately save you so much time if you consistently need to install multiple packages each time you create an environment.

Finally, to deactivate your environment, simply run

(awesome-conda-env) % conda deactivate
%

Summary#

There’s a lot of information packed into this post, so here are the key points to take away.

Virtual environments help to:

  • Resolve dependency issues by allowing you to use different versions of a package for different projects if needed, or even install packages regardless of whether you have admin privileges on that machine.
  • Make your project self-contained and reproducible.
  • Keep your folders tidy!

Whether you use conda or venv largely comes down to personal preference, but:

  • venv environments tend to run a little faster, and can be operated from inside your project folder, keeping all your project needs in one place. venv itself has no dependency manager, but there are packages out there, such as poetry that act more like conda (I just haven’t quite managed to get around to trying it out yet!).
  • conda acts both as an environment and package manager, and ensures that dependencies are met for all conda-downloaded packages (note that while you can install PyPI packages via pip, conda is less able to manage those dependencies). As a result, packages take slightly longer to download, but this might be worth it in the long run! Conda is also language-agnostic, if you frequently operate in, say, R as well as Python. This means that you won’t have to use different workflows when working in different languages!
  • Both environments can be easily integrated with Jupyter notebooks to create Data Science projects!

I’ve included a table of the key commands you need in your venv and conda environments for common commands:

venvconda
creating a new environmentpython3 -m venv [PROJECT_NAME]
activating an environmentsource [PROJECT_NAME]/bin/activateconda create --name [PROJECT_NAME] python
installing librariespip install [PACKAGE_NAME]conda install [PACKAGE_NAME]
conda install --channel conda-forge [PACKAGE_NAME]
pip install [PACKAGE_NAME]
setting up a kernel in Jupyter notebookpython -m ipykernel install --userpython -m ipykernel install --user --name=[PROJECT_NAME]
view packages in environmentpip listconda list
saving environment settingspip freeze > requirements.txtconda env export --file environment.yml
create environment from saved filepip install -r requirements.txtconda env create --name [PROJECT_NAME] --file /path/to/environment.yml
deactivate environmentdeactivateconda deactivate

I hope this post has been helpful to you reading and following along as it has for me writing it! If you’re interested in other stuff I have to say, please feel free to follow me on LinkedIn or GitHub .

Until next time!

Post-script

Once you have finished with your virtual environment, you might find you want to declutter the list of available kernels in your Jupyter Notebook. To do this, you can list the available kernels from the terminal and remove your choice using the commands below:

% jupyter kernelspec list
# Available kernels:
  awesome-conda-env                                                    /Users/user_name/Library/Jupyter/kernels/awesome-conda-env
  python3                                                              /Users/user_name/Library/Jupyter/kernels/python3
  awesome_venv                                                         /usr/local/share/jupyter/kernels/awesome_venv

% jupyter kernelspec uninstall awesome_venv awesome-conda-venv
comments powered by Disqus

© 2021