blog/_posts/2022-03-26-how-to-rewrite-g...

5.4 KiB

title date last_modified_at url layout category image description
How to rewrite Git history while keeping message commit references 2022-03-26 12:41 2022-03-29 19:17 how-to-rewrite-git-history-while-keeping-message-commit-references post Tutorials /img/blog/how-to-rewrite-git-history-while-keeping-message-commit-references_1.png Mankind has a duty of memory

A missing blog post image

Introduction

Sometimes, you would like to clean your Git history (let's say, to remove a redacted production secret still present in history, or maybe change an old committer identity).

⚠️ As such operations are very dangerous, please read this post fully before running anything, and note that I hereby decline any responsibility (as always) if something bad happens to your project.

The problem

If you reach this page, you already know the problem : rewriting Git history causes all identifiers (SHA) following the first affected commit to change, and you cannot do a thing about it.

If one of the developers used to specify commit references in their own commit messages (like This commit follows 40d5014 [...]), they won't mean anything once rewriting is done.
Moreover, if some of your commits "revert" others, they are also affected (Git does not update them automatically).

The workaround

So we have somehow to dynamically "update" commit references, while rewriting the history, according to new commit identifiers.

Below is a script implementing this, derived from one of the official GIT-FILTER-BRANCH(1) manual page examples, updating root <root@localhost> identity with John Doe <john@example.net> :

{% highlight sh %} git filter-branch
--env-filter ' if test "$GIT_AUTHOR_NAME" = "root" then GIT_AUTHOR_NAME="John Doe" fi if test "$GIT_AUTHOR_EMAIL" = "root@localhost" then GIT_AUTHOR_EMAIL=john@example.com fi if test "$GIT_COMMITTER_NAME" = "root" then GIT_COMMITTER_NAME="John Doe" fi if test "$GIT_COMMITTER_EMAIL" = "root@localhost" then GIT_COMMITTER_EMAIL=john@example.com fi '
--commit-filter ' printf "%s" "${GIT_COMMIT}," >> ../commits_mapping git commit-tree "$@" | tee -a ../commits_mapping '
--tag-name-filter cat
--msg-filter ' message="$(cat)" commit_refs="$(echo "$message" | LC_ALL=C grep -oE "\b[0-9a-fA-F]{7,40}\b")" for commit_ref in $commit_refs; do new_sha="$(grep "^${commit_ref}" ../commits_mapping | cut -d, -f2)" if test -z "$new_sha" then continue; fi commit_ref_len="$(printf "%s" "$commit_ref" | wc -m)" new_commit_ref="$(echo "$new_sha" | cut -c "1-${commit_ref_len}")" message="$(echo "$message" | sed "s/${commit_ref}/${new_commit_ref}/g")" done

	echo "$message"
' \
-- --all

{% endhighlight %}

You may have noticed that filtering scripts are fully-POSIX compatible, so they are supposed to work in most environments (maybe even yours 😉).

You will find other features too :

  • Committer identities are additionally getting updated ;

  • All branches are getting rewritten (this may not be something that you want !) ;

  • Tags are getting updated too (they will point to the same effective version of the code).

A workaround pitfall

TL; DR : beware of word collisions across commit messages.

There is a caveat that we have to share though, because of the use of regular expressions in the msg-filter script :

A missing blog post image

You might encounter collisions between commit references and real-life words, existing in your language.

For a project with commit messages written in English, you can safely run the above Git migration, because there is none :

{% highlight bash %} LC_ALL=C grep -oE "\b[0-9a-fA-F]{7,40}\b" /usr/share/hunspell/en_US.dic {% endhighlight %}

If you happened to use shorter SHA (let's say, 6-character long references), there are collisions in English :

{% highlight bash %} LC_ALL=C grep -oE "\b[0-9a-fA-F]{6,40}\b" /usr/share/hunspell/en_US.dic accede bedded cabbed dabbed decade efface facade {% endhighlight %}

For an Italian project, there are collisions, even with 7-character long references (😨) :

{% highlight bash %} LC_ALL=C grep -oE "\b[0-9a-fA-F]{7,40}\b" /usr/share/hunspell/it_IT.dic accadde decadde {% endhighlight %}

Last words

Please also note that git filter-branch usage is deprecated since Git v2.24.0, and filter-repo should be preferred.
If you managed to adapt the solution described in this post with this tool, feel free to post a comment below !

It actually appeared that filter-repo supports this feature by default ! 🎉
So it definitely should be preferred over git filter-branch, but sometimes, only legacy tools are available...


Many thanks to the co-author of this script, who will recognize himself 🙏