« | » Main « | »

REST API gotcha, and webhookdb

Saturday 20 December 2014

For the Open edX project, we like to collect statistics about our pull requests. GitHub provides a very capable API that gives me all sorts of information.

Across more than 30 repos, we have more than 9500 pull requests. To get detailed information about all of them would require at least 9500 requests to the GitHub API. But GitHub rate-limits API use, so I can only make 5000 requests in an hour, so I can't collect details across all of our pull requests.

Most of those pull requests are old, and closed. They haven't changed in a long time. GitHub supports ETags, and any request that responds with 304 Not Modified isn't counted against your rate limit. So I should be able to use ETags to mostly get cached information, and still be able to get details for all of my pull requests.

I'm using requests to access the API. The CacheControl package offers really easy integration of http caching:

from cachecontrol import CacheControlAdapter
from cachecontrol.caches import FileCache

# ...

session = requests.Session()
adapter = CacheControlAdapter(cache=FileCache(".webcache"))
session.mount("http://", adapter)
session.mount("https://", adapter)

I ran my program with this, and it didn't seem to help: I was still running out of requests against the API. Doing a lot of debugging, I figured out why. The reason is instructive for API design.

When you ask the GitHub API for details of a pull request, you get a JSON response that looks like this (many details omitted, see the GitHub API docs for the complete response):

{
  "id": 1,
  "url": "https://api.github.com/repos/octocat/Hello-World/pulls/1347",
  "number": 1347,
  "state": "open",
  "title": "new-feature",
  "body": "Please pull these awesome changes",
  "created_at": "2011-01-26T19:01:12Z",
  "updated_at": "2011-01-26T19:01:12Z",
  "closed_at": "2011-01-26T19:01:12Z",
  "merged_at": "2011-01-26T19:01:12Z",
  "head": {
    "label": "new-topic",
    "ref": "new-topic",
    "sha": "6dcb09b5b57875f334f61aebed695e2e4193db5e",
    "user": {
      "login": "octocat",
      ...
    },
    "repo": {
      "id": 1296269,
      "owner": {
        "login": "octocat",
        ...
      },
      "name": "Hello-World",
      "full_name": "octocat/Hello-World",
      "description": "This your first repo!",
      "private": false,
      "fork": false,
      "url": "https://api.github.com/repos/octocat/Hello-World",
      "homepage": "https://github.com",
      "language": null,
      "forks_count": 9,
      "stargazers_count": 80,
      "watchers_count": 80,
      "size": 108,
      "default_branch": "master",
      "open_issues_count": 0,
      "has_issues": true,
      "has_wiki": true,
      "has_pages": false,
      "has_downloads": true,
      "pushed_at": "2011-01-26T19:06:43Z",
      "created_at": "2011-01-26T19:01:12Z",
      "updated_at": "2011-01-26T19:14:43Z",
      "permissions": {
        "admin": false,
        "push": false,
        "pull": true
      }
    }
  },
  "base": {
    "label": "master",
    "ref": "master",
    "sha": "6dcb09b5b57875f334f61aebed695e2e4193db5e",
    "user": {
      "login": "octocat",
      ...
    },
    "repo": {
      "id": 1296269,
      "owner": {
        "login": "octocat",
        ...
      },
      "name": "Hello-World",
      "full_name": "octocat/Hello-World",
      "description": "This your first repo!",
      "private": false,
      "fork": false,
      "url": "https://api.github.com/repos/octocat/Hello-World",
      "homepage": "https://github.com",
      "language": null,
      "forks_count": 9,
      "stargazers_count": 80,
      "watchers_count": 80,
      "size": 108,
      "default_branch": "master",
      "open_issues_count": 0,
      "has_issues": true,
      "has_wiki": true,
      "has_pages": false,
      "has_downloads": true,
      "pushed_at": "2011-01-26T19:06:43Z",
      "created_at": "2011-01-26T19:01:12Z",
      "updated_at": "2011-01-26T19:14:43Z",
      "permissions": {
        "admin": false,
        "push": false,
        "pull": true
      }
    }
  },
  "user": {
    "login": "octocat",
    ...
  },
  "merge_commit_sha": "e5bd3914e2e596debea16f433f57875b5b90bcd6",
  "merged": false,
  "mergeable": true,
  "merged_by": {
    "login": "octocat",
    ...
  },
  "comments": 10,
  "commits": 3,
  "additions": 100,
  "deletions": 3,
  "changed_files": 5
}

GitHub has done a common thing with their REST API: they include details of related objects. So this pull request response also includes details of the users involved, and the repos involved, and the repos include details of their users, and so on.

The ETag for a response fingerprints the entire response. That means that if any data in the response changes, the ETag will change, which means that the cached copy will be ignored and the full response will be returned.

Look again at the repo information included: open_issues_count changes every time an issue is opened or closed. A pull request is a kind of issue, so that happens a lot. There's also pushed_at and updated_at, which will change frequently.

So when I'm getting details about a pull request that has been closed and dormant for (let's say) a year, the ETag will still change many times a day, because of other irrelevant activity in the repo. I didn't need those repo details on the pull request in the first place, but I always thought it was just harmless bulk. Nope, it's actively hurting my ability to use the API effectively.

Some REST API's give you control over the fields returned, or related objects included in responses, but GitHub's does not. I don't know how to use the GitHub API the way I wanted to.

So the pull request response has lots of details I don't need (the repo's owner's avatar URL?), and omit plenty of details I'm likely to need, like commits, comments, and so on. I understand, they aren't including one-to-many information at all, but I'd rather see the one-to-many than the almost certainly useless one-to-one information that is included, and is making automatic caching impossible.

Luckily, my co-worker David Baumgold had a good idea and the energy to implement it: webhookdb replicates GitHub data to a relational database, using webhooks to keep the two in sync. It works great: now I can make queries against Postgres to get details of pull requests! No rate limiting, and I can use SQL if it's a better way to express my questions.

Seeing what the computer sees

Saturday 13 December 2014

One of the challenging things about programming is being able to really see code the way the computer is going to see it. Sometimes the human-only signals are so strong, we can't ignore them. This is one of the reasons I like indentation-significant languages like Python: people attend to the indentation whether the computer does or not, so you might as well have the people and computers looking at the same thing.

I was reminded of this problem yesterday while trying to debug a sample application I was toying with. It has a config file with some strings and dicts in it. It reads in part like this:

SECRET_KEY = 'you-will-never-guess'
""" secret key for authentication
"""

PYLTI_URL_FIX = {
""" Remap URL to fix edX's misrepresentation of https protocol.
    You can add another dict entry if you have trouble with the
    PyLti URL.
"""

    "https://localhost:8000/": {
        "https://localhost:8000/": "http://localhost:8000/"
    },
    "https://localhost/": {
        "https://localhost/":"http://192.168.33.10/"
    }
}

When I saw this file, I thought, "That's a weird way to comment things," but didn't worry more about it. Then later when the response was failing, I debugged into it, and realized what was wrong with this file. Before reading on, do you see what it is?

•    •    •

•    •    •

•    •    •

Python concatenates adjacent string literals. This is handy for making long strings without having to worry about backslashes. In real code, this feature is little-used, and it happens in a surprising place here. The "docstring" for the dictionary is implicitly concatenated to the first key. PYLTI_URL_FIX has a key that's 163 characters long: " Remap URL to ... URL.\nhttps://localhost:8000/", including three newlines.

But SECRET_KEY isn't affected. Why? Because the SECRET_KEY assignment line is a complete statement all by itself, so it doesn't continue onto the next line. Its "docstring" is a statement all by itself. The PYLTI_URL_FIX docstring is inside the braces of the dictionary, so it's all part of one 13-line statement. All the tokens are considered together, and the adjacent strings are concatenated.

As odd as this code was, it was still hard to see what was going to happen, because the first string was clearly meant as a comment, both in its token form (a multiline string, starting in the first column) and in its content (English text explaining the dictionary). The second string is clearly intended as a key in the dict (short, containing data, indented). But all of those signals are human signals, not computer signals. So I as a human attended to them and misunderstood what would happen when the computer saw the same text and ignored those signals.

The fix of course is to use conventional comments. Programming is hard, yo. Stick to the conventions.

What is the Lotus Notes of today?

Sunday 7 December 2014

I have a document challenge. It's a perfect job for Lotus Notes. What do I use in its place today?

I want to keep track of a bunch of web sites, say 100-200 of them. For each, I want a free-form document that lets me keep notes about them. But I also have structured information I want to track for each, like an email contact, a GitHub repo, some statistics, and so on. I want to be able to display these documents in summarized lists, so that some of the structured information is displayed in a table, and I can sort and filter the documents based on that information.

This is exactly what Lotus Notes did well. Is there something that can do it now? Ideally, it would be part of a Confluence wiki, but other options would be good too. (Please don't say SharePoint...)

CouchDB is the perfect backend for a system like this (no wonder, it was written by Damien Katz, and inspired by his time at Lotus), but is there a GUI client that makes it a complete application?

Say what you will about Lotus Notes, it was really good at this kind of job.

« | » Main « | »