Bug #915: please help!

Sunday 12 January 2020

Updated: this was solved on Hacker News. Details in Bug #915: solved!

I just released coverage.py 5.0.3, with two bug fixes. There was another bug I really wanted to fix, but it has stumped me. I’m hoping someone can figure it out.

Bug #915 describes a disk I/O failure. Thanks to some help from Travis support, Chris Caron has provided instructions for reproducing it in Docker, and they work: I can generate disk I/O errors at will. What I can’t figure out is what coverage.py is doing wrong that causes the errors.

To reproduce it, start a Travis-based docker image:

cid=$(docker run -dti --privileged=true --entrypoint=/sbin/init \
    -v /sys/fs/cgroup:/sys/fs/cgroup:ro \
    travisci/ci-sardonyx:packer-1542104228-d128723)
docker exec -it $cid /bin/bash

Then in the container, run these commands:

su - travis
git clone --branch=nedbat/debug-915 https://github.com/nedbat/apprise-api.git
cd apprise-api
source ~/virtualenv/python3.6/bin/activate
pip install tox
tox -e bad,good

This will run two tox environments, called good and bad. Bad will fail with a disk I/O error, good will succeed. The difference is that bad uses the pytest-cov plugin, good does not. Two detailed debug logs will be created: debug-good.txt and debug-bad.txt. They show what operations were executed in the SqliteDb class in coverage.py.

The Big Questions: Why does bad fail? What is it doing at the SQLite level that causes the failure? And most importantly, what can I change in coverage.py to prevent the failure?

Some observations and questions:

  • If I change the last line of the steps to “tox -e good,bad” (that is, run the environments in the other order) then the error doesn’t happen. I don’t understand why that would make a difference.
  • I’ve tried adding time.sleep’s to try to slow the pace of database access, but maybe in not enough places? And if this fixes it, what’s the right way to productize that change?
  • I’ve tried using the detailed debug log to create a small Python program that in theory accesses the SQLite database in exactly the same way, but I haven’t managed to create the error that way. What aspect of access am I overlooking?

If you come up with answers to any of these questions, I will reward you somehow. I am also eager to chat if that would help you solve the mysteries. I can be reached on email, Twitter, as nedbat on IRC, or in Slack. Please get in touch if you have any ideas. Thanks.

Comments

[gravatar]
Do you by any chance kill a VM at the end of the test?

If you do, this error sounds like a file/disk buffering problem. I've run into a few of them in my time. If this is the case, the only robust-ish solution is to wait ~20 seconds at the end of the test.
[gravatar]
I've seen in the past that if you have a file the container references on the underlying filesystem and for some reason the inode of that file changes. Docker will lose it's mind. Don't know if that's your situation, but might be something to investigate.
[gravatar]
Just in case you're not aware, there is a diagnosis in the Hacker News thread for this post: https://news.ycombinator.com/item?id=22030974
[gravatar]
Thanks everyone. The Hacker News thread came up with the answer, which I summarized in Bug #915: solved!

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.