Updated: this was solved on Hacker News. Details in Bug #915: solved!
I just released coverage.py 5.0.3, with two bug fixes. There was another bug I really wanted to fix, but it has stumped me. I’m hoping someone can figure it out.
Bug #915 describes a disk I/O failure. Thanks to some help from Travis support, Chris Caron has provided instructions for reproducing it in Docker, and they work: I can generate disk I/O errors at will. What I can’t figure out is what coverage.py is doing wrong that causes the errors.
To reproduce it, start a Travis-based docker image:
cid=$(docker run -dti --privileged=true --entrypoint=/sbin/init \
-v /sys/fs/cgroup:/sys/fs/cgroup:ro \
travisci/ci-sardonyx:packer-1542104228-d128723)
docker exec -it $cid /bin/bash
Then in the container, run these commands:
su - travis
git clone --branch=nedbat/debug-915 https://github.com/nedbat/apprise-api.git
cd apprise-api
source ~/virtualenv/python3.6/bin/activate
pip install tox
tox -e bad,good
This will run two tox environments, called good and bad. Bad will fail with a disk I/O error, good will succeed. The difference is that bad uses the pytest-cov plugin, good does not. Two detailed debug logs will be created: debug-good.txt and debug-bad.txt. They show what operations were executed in the SqliteDb class in coverage.py.
The Big Questions: Why does bad fail? What is it doing at the SQLite level that causes the failure? And most importantly, what can I change in coverage.py to prevent the failure?
Some observations and questions:
- If I change the last line of the steps to “tox -e good,bad” (that is, run the environments in the other order) then the error doesn’t happen. I don’t understand why that would make a difference.
- I’ve tried adding time.sleep’s to try to slow the pace of database access, but maybe in not enough places? And if this fixes it, what’s the right way to productize that change?
- I’ve tried using the detailed debug log to create a small Python program that in theory accesses the SQLite database in exactly the same way, but I haven’t managed to create the error that way. What aspect of access am I overlooking?
If you come up with answers to any of these questions, I will reward you somehow. I am also eager to chat if that would help you solve the mysteries. I can be reached on email, Twitter, as nedbat on IRC, or in Slack. Please get in touch if you have any ideas. Thanks.
Comments
If you do, this error sounds like a file/disk buffering problem. I've run into a few of them in my time. If this is the case, the only robust-ish solution is to wait ~20 seconds at the end of the test.
Add a comment: