REST API gotcha, and webhookdb

Saturday 20 December 2014This is ten years old. Be careful.

For the Open edX project, we like to collect statistics about our pull requests. GitHub provides a very capable API that gives me all sorts of information.

Across more than 30 repos, we have more than 9500 pull requests. To get detailed information about all of them would require at least 9500 requests to the GitHub API. But GitHub rate-limits API use, so I can only make 5000 requests in an hour, so I can’t collect details across all of our pull requests.

Most of those pull requests are old, and closed. They haven’t changed in a long time. GitHub supports ETags, and any request that responds with 304 Not Modified isn’t counted against your rate limit. So I should be able to use ETags to mostly get cached information, and still be able to get details for all of my pull requests.

I’m using requests to access the API. The CacheControl package offers really easy integration of http caching:

from cachecontrol import CacheControlAdapter
from cachecontrol.caches import FileCache

# ...

session = requests.Session()
adapter = CacheControlAdapter(cache=FileCache(".webcache"))
session.mount("http://", adapter)
session.mount("https://", adapter)

I ran my program with this, and it didn’t seem to help: I was still running out of requests against the API. Doing a lot of debugging, I figured out why. The reason is instructive for API design.

When you ask the GitHub API for details of a pull request, you get a JSON response that looks like this (many details omitted, see the GitHub API docs for the complete response):

{
  "id": 1,
  "url": "https://api.github.com/repos/octocat/Hello-World/pulls/1347",
  "number": 1347,
  "state": "open",
  "title": "new-feature",
  "body": "Please pull these awesome changes",
  "created_at": "2011-01-26T19:01:12Z",
  "updated_at": "2011-01-26T19:01:12Z",
  "closed_at": "2011-01-26T19:01:12Z",
  "merged_at": "2011-01-26T19:01:12Z",
  "head": {
    "label": "new-topic",
    "ref": "new-topic",
    "sha": "6dcb09b5b57875f334f61aebed695e2e4193db5e",
    "user": {
      "login": "octocat",
      ...
    },
    "repo": {
      "id": 1296269,
      "owner": {
        "login": "octocat",
        ...
      },
      "name": "Hello-World",
      "full_name": "octocat/Hello-World",
      "description": "This your first repo!",
      "private": false,
      "fork": false,
      "url": "https://api.github.com/repos/octocat/Hello-World",
      "homepage": "https://github.com",
      "language": null,
      "forks_count": 9,
      "stargazers_count": 80,
      "watchers_count": 80,
      "size": 108,
      "default_branch": "master",
      "open_issues_count": 0,
      "has_issues": true,
      "has_wiki": true,
      "has_pages": false,
      "has_downloads": true,
      "pushed_at": "2011-01-26T19:06:43Z",
      "created_at": "2011-01-26T19:01:12Z",
      "updated_at": "2011-01-26T19:14:43Z",
      "permissions": {
        "admin": false,
        "push": false,
        "pull": true
      }
    }
  },
  "base": {
    "label": "master",
    "ref": "master",
    "sha": "6dcb09b5b57875f334f61aebed695e2e4193db5e",
    "user": {
      "login": "octocat",
      ...
    },
    "repo": {
      "id": 1296269,
      "owner": {
        "login": "octocat",
        ...
      },
      "name": "Hello-World",
      "full_name": "octocat/Hello-World",
      "description": "This your first repo!",
      "private": false,
      "fork": false,
      "url": "https://api.github.com/repos/octocat/Hello-World",
      "homepage": "https://github.com",
      "language": null,
      "forks_count": 9,
      "stargazers_count": 80,
      "watchers_count": 80,
      "size": 108,
      "default_branch": "master",
      "open_issues_count": 0,
      "has_issues": true,
      "has_wiki": true,
      "has_pages": false,
      "has_downloads": true,
      "pushed_at": "2011-01-26T19:06:43Z",
      "created_at": "2011-01-26T19:01:12Z",
      "updated_at": "2011-01-26T19:14:43Z",
      "permissions": {
        "admin": false,
        "push": false,
        "pull": true
      }
    }
  },
  "user": {
    "login": "octocat",
    ...
  },
  "merge_commit_sha": "e5bd3914e2e596debea16f433f57875b5b90bcd6",
  "merged": false,
  "mergeable": true,
  "merged_by": {
    "login": "octocat",
    ...
  },
  "comments": 10,
  "commits": 3,
  "additions": 100,
  "deletions": 3,
  "changed_files": 5
}

GitHub has done a common thing with their REST API: they include details of related objects. So this pull request response also includes details of the users involved, and the repos involved, and the repos include details of their users, and so on.

The ETag for a response fingerprints the entire response. That means that if any data in the response changes, the ETag will change, which means that the cached copy will be ignored and the full response will be returned.

Look again at the repo information included: open_issues_count changes every time an issue is opened or closed. A pull request is a kind of issue, so that happens a lot. There’s also pushed_at and updated_at, which will change frequently.

So when I’m getting details about a pull request that has been closed and dormant for (let’s say) a year, the ETag will still change many times a day, because of other irrelevant activity in the repo. I didn’t need those repo details on the pull request in the first place, but I always thought it was just harmless bulk. Nope, it’s actively hurting my ability to use the API effectively.

Some REST API’s give you control over the fields returned, or related objects included in responses, but GitHub’s does not. I don’t know how to use the GitHub API the way I wanted to.

So the pull request response has lots of details I don’t need (the repo’s owner’s avatar URL?), and omit plenty of details I’m likely to need, like commits, comments, and so on. I understand, they aren’t including one-to-many information at all, but I’d rather see the one-to-many than the almost certainly useless one-to-one information that is included, and is making automatic caching impossible.

Luckily, my co-worker David Baumgold had a good idea and the energy to implement it: webhookdb replicates GitHub data to a relational database, using webhooks to keep the two in sync. It works great: now I can make queries against Postgres to get details of pull requests! No rate limiting, and I can use SQL if it’s a better way to express my questions.

» 5 reactions

Comments

[gravatar]
Roger Lipscombe 9:10 PM on 20 Dec 2014
Related: Last year, Rob Ashton did a series of posts on the github event API, and using EventStore to aggregate them: http://codeofrob.com/entries/less-abstract,-pumping-data-from-github-into-the-eventstore.html
[gravatar]
From the service's point of view it also makes a lot of sense to only return data that the client requests. Presumably there is a non-zero cost to generating that data but, more importantly, it allows you to see what parts of your API your clients are actually using. If you don't know what your clients are doing, it limits your ability to scale your system and evolve your API.

This smells like an internal API--designed under the assumption that Github controlled both the client and the server---that was later exposed to external clients.
[gravatar]
My experience is that some changes to the github data don't trigger the hook. For example support at github assured me "if the body of an issue is updated -- you won't see an issue event for that." Absent a full list of the field that don't generate an even that you can intesect with the fields you care about ... well you end up having to rescan. So, I have a mirror of the set of issues for a large project and a daily sweep to refresh it; that takes two hours. It's frustrating because the reports I generate often have errors when people work on the issues during the day. So, I have more frequent resync's for a subset of the issues that I heuristically guess are likely to be updated.
[gravatar]
I think the problem is that the Github API has a design flaw. The ETag should be based the state of the resource you requested--not the representation of that resource. I suppose they can include additional data if they want (although I agree they should make this optional), but including that extra data in the response should not affect the ETag
[gravatar]
On the other hand... In my reading of the HTTP spec, this isn't clear. Do equivalent ETags indicate that the underlying resources are equivalent or that the responses to GETs on that resource are equivalent? I've seen additional APIs that are explicit about the fact that the response is used to create the ETag. For example, the Facebook Graph API makes it clear that ETags are created from the entire response including formatting and that since a change in the user-agent header might change the formatting, requests from different user-agents on the same resource could result in different ETags. Many other RESTful APIs I've seen create a hash of the on-disk content (if not too expensive) which I think it the right way to go.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.