Cherry-picking a pull request

Sunday 16 May 2021

At work, we work in GitHub pull requests that get merged to the main branch. We also have twice-yearly community release branches, and a small fraction of the main-branch changes need to be copied onto the current release branch. Trying to automate choosing the commits to cherry-pick lead me into some Git and GitHub complexities.

Git has three different ways to finish up a pull request, which complicates the process of figuring out what to cherry-pick. Before getting into cherry-picking, let’s look at the three finishes to pull requests. Suppose we have four commits on the main branch (A-B-C-D), and a pull request for a feature branch started from B with two commits (F-G) on it:

ABCDEFG

The F-G pull request can be brought into the main branch in three ways. First, the F-G commits can be merged to main with a merge commit:

ABCDEFGM

Second, the two commits can be rebased onto main as two new commits Fr-Gr (for F-rebased and G-rebased):

ABCDEFGFrGr

Lastly, the two commits can be squashed down to one new commit FGs (for F and G squashed):

ABCDEFGFGs

Note that for rebased and squashed pull requests, the original commits F-G will not be reachable from the main branch, and will eventually disappear from the repo, indicated by their dashed outlines.

Now let’s consider the release branch. This is a branch made twice a year to mark community releases of the platform. Once the branch is made, some fixes need to be cherry-picked onto it from the main branch. We can’t just merge the fixes, because that would bring the entire history of the main branch into the release. Cherry-picking lets us take just the commits we want.

As an example, here E has been cherry-picked as Ec:

ABCDERSEc

The question now is:

To get the changes from a finished pull request onto the release branch, what commits should we cherry-pick?

The two rules are:

  1. The commits should make the same change to the release branch that were made to the main branch, and
  2. The commits should be reachable from the main branch, in case we need to later investigate how the changes came to be.

GitHub doesn’t record what approach was used to finish a pull request (unless I’ve missed something). It records what it calls the “merge commit”. For merged pull request, this is the actual merge commit. For rebased and squashed pull requests, it’s the final commit that ended up on the main branch.

In the case of a merged pull request, the answer is easy: cherry-pick the two original commits in the pull request. We can tell the pull request was merged because the merge commit (with a thicker outline) has two parents (it’s actually a merge):

ABCDEFGRSMFcGc

But for rebased and squashed pull requests, the answer is not so simple. We can tell the pull request wasn’t merged, because the recorded “merge commit” isn’t a merge. Somehow we have to figure out how many commits starting with the merge commit are the right ones to take. For a rebased pull request we’d like to cherry-pick as many commits as the pull request had:

ABCDEFGRSFrGrFrcGrc

And for a squashed pull request, we want to cherry-pick just the one squashed commit:

ABCDEFGRSFGsFGsc

But how to tell the difference between these two situations? I don’t know the best approach. Maybe comparing the commit messages? My first way was to look at the count of added and deleted lines. If the merge commit changes as many lines as the pull request as a whole, then just take that one commit. But that could be wrong if a rebased pull request had overlapping commits, and the last commit changed all the lines.

Is there some bit of information I’ve overlooked? Does git or GitHub have a way to unambiguously distinguish these cases?

Comments

[gravatar]

To get the changes from a finished pull request onto the release branch, what commits should we cherry-pick?

I think this is the wrong question. Rather, we should be asking "how do we programmatically relate rebased versions of the same commits in git?" (cherry-picking is just another form of rebasing).

The challenge with this situation is that rebased versions of the same commit may have subtle differences due to different conflict resolution with their parent commit. They could potentially have different commit messages.

The two main strategies I can think of are to store this relation in a separate data store somehow or enforce that the relation is stored in the commit metadata (probably the commit message). The latter could be done, for example, but ensuring that the commit messages always had a link to a PR (thus forming the relation) or by having the commit message of the release branch version of the commit reference the trunk version of the commit. You'd have to commit to using tooling and processes to ensure that this relation was maintained.

So the follow-on question is "how are you going to use this relation information?" Is it just for ad-hoc inspection, or do you need it to be 100% correct all the time to drive some process like automated release notes generation.

My personal preference is to just reference the PR in the commit message, since this is often useful information to anyone perusing the commit. When this isn't enforced by tooling, though, it is possible to commit without the link and then it's over—unless you're willing to go through the pain of a history rewrite that information is effectively lost.

[gravatar]

I think I kind of completely misinterpreted the need here. I was assuming that if only a small fraction of changes needed to be cherry-picked to release branches, that this could just be done manually. The problem I wrote about above is that if figuring out where these changes came from after you cherry-pick them.

[gravatar]

You may already be aware, but FYI, these diagrams don't come through in the RSS feed... or at least Feedly doesn't render them.

[gravatar]

@Nick, no worries, I had a hard time laying out the problem, and I'm sure I skipped some context.

@Alex, the RSS feed has the SVG figures (Firefox displays them), but yeah, looks like Feedly just doesn't. :( Your comment got me to dig into the styling and make the SVG look reasonable in places like RSS feeds, so thanks.

[gravatar]

Hm, is this equivalent to PyPy's biggest reason for not using Git?

I don't know Git well enough to suggest anything that could be done now, but maybe some client-side commit hooks could require tagging commits with an issue number or something in the future?

[gravatar]

@mwchase: PyPy's concern is different: Mercurial records the branch in the commits, so you can tell what branch created each commit. Git does not.

[gravatar]

Hi Ned. I don't have any insights to offer re the issue at hand, but am curious what application/program you used to generate those figures. Thanks.

[gravatar]

@Zhengnan, I used some hand-rolled Python and the Cupid hack: https://nedbatchelder.com/blog/201401/svg_figures_with_cupid.html

Here is the code, though probably not in runnable form: https://gist.github.com/nedbat/271f24f2fb8f0c3832885f491f7a3ca1

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.