GitLab merge diffs for branches that rewrite commit history

For the development of the Wayland driver for Wine we have been maintaining a commit series on top of upstream Wine, which we want to keep clean and easily reviewable.

For this reason, most proposed changes in the code do not typically appear as new fix commits on top of the current series, but rather transform the existing series in place, producing an updated clean version. For example, if we have the commit series a-b-c-d-e, an update may fix something in c, leading to the updated series a-b-f-d'-e' (we use d'/e' to signify that the commit introduces the same changes as d/e).

Although in this instance we have good reasons to use git in this manner, this public rewriting of commit history is not how git is typically used, and we have found that some tools do not deal too well with this scenario.

One particular pain point we have encountered is in our internal reviewing process. Although it doesn't seem possible to use GitLab to perform the actual merge of a branch that rewrites history, we would still like to get a sensible merge diff for reviews and discussion. However, when we try this we are presented with a seemingly non-sensical diff. To understand what's going on, and hopefully find a solution, we need to dive in more deeply into how GitLab, and similar tools, produce merge diffs.

Before we start our journey it would be beneficial to establish a common vocabulary for commits and operations on them. Please read the Appendix for details about the notation used in this post.

GitLab merge diffs

GitLab provides two merge diff methods, base and HEAD.

base is the older method and works by using the ... diff operator. The ... diff operator aims to provide a diff that represents the intention of the developer at the time of branching, regardless of the current state of the target branch. It achieves this by finding the most recent CommonAncestor of target and source, also called the merge base, and diff-ing the source with that ancestor. In the example below:

abcdfghtargetsource

b is the most recent ancestor of target and source, so we get:

Diff target...source =
Diff (CommonAncestor target source)..source =
Diff b..h

Changes in the target branch after the source's branching point are effectively ignored by the ... diff.

The problem with the base method is that the produced diff contents tend to become obsolete as the merge target (which is typically the master/main branch) evolves. To make the diff more accurate the HEAD diff method was introduced.

In the HEAD method the target branch is merged into the source branch, and the resulting commit of that merge is compared with the target branch 1.

abcdmfghtargetsource

In this case the produced diff comes from :

Diff target..(MergeInto target source) =
Diff d..m

This diff is a more accurate representation of what one would expect if the source branch was actually merged at that point in time.

However, the HEAD method may fail due to conflicts during merge, and, in this case, the current GitLab behavior is to fall back to base.

The problem

Now, let's consider what happens when we ask GitLab to provide the merge diff for our original example, trying to merge a branch fix that rewrites history, into master:

abcdefd'e'masterfix

First GitLab will try the HEAD method. If the merge of master into fix succeeds then we will get a nice diff. Chances are that this merge will fail, so we will fall back to the base method. The base method will then give us git diff b..e' whereas we would like to see git diff e..e'.

Making the HEAD method work

How can we trick GitLab into producing a sane diff for our reviewing purposes? One idea is to try to make the HEAD method succeed and provide the expected results. This can be expressed by the following pseudo-equations:

Diff master..(MergeInto master fix) = Diff master..fix
Tree (MergeInto master fix) = Tree fix

If we could change that pesky MergeInto operation to become MergeIntoKeepTarget we would get our result instantly. Sadly, changing GitLab itself is outside the scope of this effort.

Not all is lost, however. We can't change MergeInto or master, but we are free to change fix (since we are the ones proposing it) to fulfill the equation. The only requirement is that the changed fix, let's call it fix', has the same tree contents as fix itself (although it may have additional merge parents).

But which fix' can we use to fulfill the above equality? There is a notable property of merging that we can make use of: if the merge target already has the merge source as a merge parent then the merge operation is a no-op. In other words the following holds:

SomeMerge2 x (SomeMerge1 x y) = SomeMerge1 x y

This property allows us to simplify our previous equation to:

Tree (MergeInto master fix') = Tree fix'

as long as fix' is of the form SomeMerge master fix. Given that we want fix' to have the same contents as fix, the decision about which merge to use is easy:

fix' = MergeIntoKeepTarget master fix

abcdefd'e'mmasterfix'

Here is what happens if we use the GitLab HEAD method to produce the diff of merging fix' into master:

Diff master..(MergeInto master fix') = Diff e..(MergeInto e m)

Since m already has e as a merge parent, MergeInto e m just gives m, so we get:

... = Diff e..m

Because MergeIntoKeepTarget gave m the same tree contents as e' we can replace one for the other in the .. diff operation:

... = Diff e..e' = Diff master..fix

Making the base method work

How about trying to bend the base method to our will? This could be useful for systems that don't support HEAD (i.e., GitHub or older versions of GitLab). This is expressed by the following pseudo-equations:

Diff master...fix = Diff master..fix // Note three vs two dot diffs
Diff (CommonAncestor master fix)..fix = Diff master..fix

As in our previous attempt, we can't change master, but we can change fix, as long as we produce a new fix' with the same contents:

Diff (CommonAncestor master fix')..fix' = Diff master..fix
Tree (CommonAncestor master fix') = Tree master AND Tree fix' = Tree fix

But which fix' can we use to fulfill the above equality this time? Another interesting property comes to the rescue: if commit x is an ancestor of y, then CommonAncestor x y = x. This provides a possible way forward: we need to make master an ancestor of fix'. The only way to change commit connections is with a merge, and we have just the merge we need, MergeIntoKeepTarget:

fix' = MergeIntoKeepTarget master fix

abcdefd'e'mmasterfix'

By a happy coincidence (or is it?) we again get the same solution as with the HEAD method!

Here is what happens if we use the GitLab base method to produce the diff of merging fix' into master:

Diff master...fix' = Diff e...m

The ancestry relationship between m and e makes the ... diff operator between them act effectively like the .. diff operator, so we get:

... = Diff e..m

Because MergeIntoKeepTarget gave m the same tree contents as e' we can replace one for the other in the .. diff operation:

... = Diff e..e' = Diff master..fix

Putting theory into practice

The theory behind our solution may be somewhat involved, but it leads to a straightforward practical recipe. It all boils down to creating the fix' commit and using that as our merge request source. Since:

fix' = MergeIntoKeepTarget master fix

we can create a detached/disposable fix' by doing:

$ git checkout --detach fix
$ git merge -sours master

Then we can push our detached fix' to our MR branch, for example by doing:

$ git push our-remote HEAD:fix

We can also verify that our solution works by locally getting the diff that GitLab will produce for the base method:

$ git diff master...HEAD

And for the HEAD method:

$ git merge master
Already up to date
$ git diff master..HEAD

Appendix: A notation for commits and their operations

In the notation used in this post each commit is represented by a letter, denoting the changes introduced by the commit. A prime (') symbol (e.g., a') denotes a commit introducing the same changes as its non-prime counterpart.

A commit series is denoted with a graph of commits, each commit pointing to its parent(s). A fork in the graph is a branching point:

abcdefg

A few interesting operations:

Tree x

The contents of the tree at commit x.

CommonAncestor x y

The most recent common ancestor of commits x and y. For example, in our sample commit graph we have CommonAncestor e g = c.

Diff x..y

The difference between the tree contents of commits x and y.

Diff x...y

The difference between the tree contents of commit y and base = CommonAnsestor x y.

MergeInto x y

Merges tip x into tip y, producing a new commit. For example, MergeInto e g produces the following m:

abcdefgm

The contents of commit m are not uniquely defined. They are determined by the merge algorithm, and, in case of conflicts, by explicit user input.

MergeIntoKeepTarget x y

Merges tip x into tip y, producing a new commit which has the same tree contents as y. For example, MergeIntoKeepTarget e g gives:

abcdefgg'

In terms of git commands this operation corresponds to git merge -sours e, assuming g is checked out before the merge.


1. Note that the GitLab documentation, actually says that the target branch is artificially merged into the source branch, then the resulting merge ref is compared to the source branch, rather than the other way around. I believe this to be inaccurate since it contradicts both the article which inspired this feature, and my understanding of the GitLab code.