Pip science experiments

Friday 30 October 2015

My day job is working on Open edX. It's large, and our requirements files are getting unruly. In particular, our requirements file for installing our other GitHub repos has grown very long in the tooth.

First, we have a mix of -e installs and non-e installs. -e means, check out the git working tree, and then install it as the code. This makes it easy to use the code: you can skip the discipline of writing and properly maintaining a setup.py. Just changing the SHA in the github URL should bring in new code.

We also have inconsistent use of "#egg" syntax in the URLs, and we don't always include the version number, and when we do, we use one of three different syntaxes for it.

Worse, we'd developed a cargo-cult mentality about the mysteries of what pip might do. No one had confidence about the different behavior to expect from the different syntaxes. Sometimes updated code was being installed, and sometimes not.

I did an experiment where I made a simple package with just a version number in it (version_dummy), and I tried installing it in various ways. I found that I had to include a version number in the hash fragment at the end of the URL to get it to update properly. Then another engineer did a similar experiment and came to the opposite conclusion, that just changing the SHA would be enough.

As bad as cargo-culting is, this was even worse: two experiments designed to answer the same question, with different results! It was time to get serious.

An important property of science is reproducibility: another investigator should be able to run your experiment to see if they get the same results. On top of that, I knew I'd want to re-run my own experiment many times as I thought of new twists to try.

So I wrote a shell script that automated the installation and verification of versions. You can run it yourself: create a new virtualenv, then run the script.

I asked in the #pypa IRC channel for help with my mystery, and they had the clue I needed to get to the bottom of why we got two different answers. I had a git URL that looked like this:

git+https://github.com/nedbat/version_dummy@123abc456#egg=version_dummy

He had a URL like this:

git+https://github.com/otherguy/example@789xyz456#egg=example

These look similar enough that they should behave the same, right? The difference is that mine has an underscore in the name, and his does not. My suffix ('#egg=version_dummy') is being parsed inside pip as if the package name was "version" and the version was "dummy"! This meant that updating the SHA wouldn't install new code, because pip thought it knew what version it would get ("dummy"), and that's the version it already had, so why install it?

Writing my experiment.sh script gave me a good place to try out different scenarios of updating my version_dummy from version 1.0 to 2.0.

Things I learned:

  • -e installs work even if you only change the SHA, although there remains superstition around the office that this is not true. That might just be superstition, or there might be scenarios where it fails that I haven't tried yet.
  • If you use a non-e install, you have to supply an explicit version number on the URL, because punctuation in the package name can confuse pip.
  • If you install a package non-e, and then update it with a -e install, you will have both installed, and you'll need to uninstall it twice to really get rid of it.
  • There are probably more scenarios that I haven't tried yet that will come back to bite me later. :(

If anyone has more information, I'm really interested.

» 5 reactions

Comments

[gravatar]
Dane Hillard 2:01 PM on 30 Oct 2015

My experience says that installing a package from a remote URL non-e and then installing a local copy with -e doesn't produce two copies of the package. Rather, you end up with a file

/path/to/virtualenv/lib/pythonX.x/site-packages/my-package.egg-link
that points to the local copy. Perhaps installing with -e using a non-local URI behaves differently.

[gravatar]
Ryne Everett 2:03 PM on 1 Nov 2015

> -e installs work even if you only change the SHA, although there remains superstition around the office that this is not true. That might just be superstition, or there might be scenarios where it fails that I haven't tried yet.

This is correct; -e VCS installations are always fetched on update because pip doesn't store any state and therefore has no idea if the source has changed.

> If you use a non-e install, you have to supply an explicit version number on the URL, because punctuation in the package name can confuse pip.

I'm not aware of the punctuation issue, but the problem with non-e installs is that they are only updated if the package version number (i.e., in setup.py) has changed. For instance, if you change a non-e requirement to a commit with the same version number in setup.py and `pip install -U`, it will not be updated. It therefore seems that non-e installs are worthless in all but ephemeral environments.

[gravatar]
Ned Batchelder 4:17 PM on 7 Nov 2015

@Ryne: i'm not sure why you say "non-e installs are worthless in all but ephemeral environments"? They are good for packages that update their version numbers, which is a good practice.

[gravatar]
Ryne Everett 4:35 PM on 7 Nov 2015

@Ned: If the version you want has a unique version number then why use a git installation? The only reason I know of to use a git installation is to select a specific commit or branch that wasn't published to the cheese shop and therefore probably doesn't have a unique version number. Are there many packages that update the version number on every commit?

[gravatar]
Ned Batchelder 5:40 PM on 7 Nov 2015

@Ryne, OK, I see your point: if you are updating the version number, then you can publish to PyPI, and don't need a git install.

Add a comment:

name
email
Ignore this:
not displayed and no spam.
Leave this empty:
www
not searched.
 
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.