What files should coverage measure?

Wednesday 28 March 2012This is almost 13 years old. Be careful.

Maybe this is crazy, but I’m looking for advice.

Conceptually, coverage.py is pretty simple. First, using the sys.settrace facility in Python, record every line that is executed. Then, after the program is done, report on those lines, and especially on lines that could have been executed but were not.

Of course, the reality is more difficult. During execution, to record the line, we have to find the file name, which we get from the stack frame. Later, we look for that file by name to create the report. Sometimes, the file isn’t a Python file!

One reason this can happen is if the file was actually created by a tool, and the tool provides the original source file as the reported name. For example, Jinja compiles .html files to Python code, and when the code is running, it claims to be “mytemplate.html”. When coverage.py tries to report on the file, it can’t parse it as Python, and things go wrong.

Originally, this error would be reported to the user. There’s a -i switch that shuts off all errors like this, but it seemed dumb for coverage.py to get confused by something like this. So I changed it to not trace files named “*.html”.

Of course, the world is more varied than that, so I got a report of someone with Jinja2 files named “*.jinja2” which now trip the error. So I need a more general solution.

I figure there are a couple of possibilities:

  1. Don’t measure files at all if they have an extension that isn’t “.py”. This will let us measure extension-less files, and .py files, and will ignore all the rest, on the theory that any other extension implies that we won’t be able to parse it later anyway.
  2. Measure all files, but during reporting, if a file can’t be parsed, ignore the error if it has an extension that isn’t “*.py”.
  3. (Shudder) Make a configuration option about what extensions to measure, or which to ignore.
  4. Some people want “ignore errors” to be the default, but if a file is missing for some reason, it’s important to know, because it will throw off the reporting, and that shouldn’t happen silently.

Do people ever name their Python source files something other than “*.py”? Are there weird ecosystems like this that I’ll only hear about if I make one of these changes?

Comments

[gravatar]
Option (2) seems like much the best: it makes a best effort to produce useful output, doesn't bother the user with pointless error messages, and doesn't require a configuration option.

There are a couple of good reasons to give your Python file a non-standard extension: (a) because of an extension-based policy, for example on a web server where only files with the .cgi extension get executed as CGI scripts; (b) for command-line tools where the user prefers to type "foo" rather than "foo.py".

How about option (5) — measure all files; try to parse them as Python; if that fails, report naïve (line-based rather than code-based) coverage metrics for them. This might give useful results even for Jinja's .html templates.
[gravatar]
>> Do people ever name their Python source files something other than "*.py"?

On Windows, the ".pyw" extension is used to run Python programs without creating a console window.
[gravatar]
I'd cope with option 4 by emitting a single message if the number of files that couldn't be found is non-zero. But this should be independent of most of the other stuff you are considering.
[gravatar]
Couldn't you provide hooks for people who create their own translators to python, so they can provide some code that your tool would use to understand their source files?
[gravatar]
Tornado's templates generate code similarly to Jinja's, but we set the fake filename to "mytemplate.generated.py" (after several iterations) because this gives the best stack traces on errors. However, since these files never exist on disk we have to turn on ignore_errors in our coverage reports (this is different from the jinja issue, where the file exists but is not python). A narrower version of ignore_errors might be nice, either a filename filter (as in #3), or the option to ignore files that don't exist without ignoring other errors.

Another more ambitious option would be to grab the generated source at runtime: Tornado templates support the PEP 302 loader protocol so linecache works on them.
[gravatar]
I use/distribute executable python files without the *.py by using the "#!/usr/bin/env python" idiom. So please have any solution take into account that a file may be python code without any extension.
[gravatar]
Hooks like artem suggested would be fantastic if they could be used to provide a coverage report of template files. It'd be very nice to see what the coverage of a template was.
[gravatar]
What about projects that use config files that are just Python source files rather than yaml or xml or whatever? I know that's fairly common. Usually they just use the .py extension though rather than another extension but not always.
[gravatar]
I vote for option 2.
[gravatar]
Thanks for all the opinions!

All: When I said, "if it has an extension that isn't *.py", I wasn't including "no extension" in that. Extension-less files are safe!

@Artem: "hooks" are an interesting idea, but I don't know if tool makers would be able to perform the back-mapping.

Anyone else have specific cases of files with unusual extensions?
[gravatar]
BuildBot has master.cfg which is Python, not sure it's useful to measure coverage in though. There are classes of "configuration" files which are actually Python in disguise.
[gravatar]
Lennart Regebro 2:42 AM on 29 Mar 2012
To me it seems that if it can't parse a file, it should output an error message saying "File could not be parsed, it seems that it does not contain valid Python". If you have a lot of these, you could shut them up with -i.

Possibly you could treat files without extension or an extension starting in ".py" differently, but I'm not sure there is a need.
[gravatar]
File extensions fall into three classes:
1) Must be Python code, exactly like ".py" files, and it should be reported as an error if it cannot be parsed satisfactorily (unless silenced by general error suppression). For example: ".py2" & ".py3" or similar conventions, fancy extensions for application scripts or CGI-like setups.
2) Could be Python code or not; there's no way to tell in advance, and no error should be reported if it isn't. For example: files with no extension, which on POSIX systems might be executable Python scripts or executable scripts for some other interpreter or something completely different.
3) It isn't expected to be Python code, never try to parse it as it would be a waste of time. For example, the mentioned extended-HTML templates.

I suggest a safe default (".py" in class 1, extension-less in class 2, anything else in class 2 or class 3) and two or three optional commandline options to override the default (maybe "-pythonextension", "-maybepythonextension", "-nonpythonextension").

This policy about the "deluxe" treatment of Python sources could be combined with option 4 (reporting missing files), as checking that a file exists and contains the lines referenced in Python bytecode doesn't require parsing it. Another commandline option would be needed to reverse the default.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.