blog/_posts/2022-03-26-how-to-rewrite-g...

135 lines
5.4 KiB
Markdown

---
title: "How to rewrite Git history while keeping message commit references"
date: 2022-03-26 12:41
last_modified_at: 2022-03-29 19:17
url: how-to-rewrite-git-history-while-keeping-message-commit-references
layout: post
category: Tutorials
image: /img/blog/how-to-rewrite-git-history-while-keeping-message-commit-references_1.png
description: "Mankind has a duty of memory"
---
[![A missing blog post image](/img/blog/how-to-rewrite-git-history-while-keeping-message-commit-references_1.png)](/img/blog/how-to-rewrite-git-history-while-keeping-message-commit-references_1.png)
### Introduction
Sometimes, you would like to clean your Git history (let's say, to remove [a redacted production secret still present in history](https://www.root-me.org/en/Challenges/Web-Server/Insecure-Code-Management), or maybe change an old committer identity).
> :warning: As such operations are very dangerous, please read this post **fully** before running anything, and note that I hereby decline any responsibility (as always) if something bad happens to your project.
### The problem
If you reach this page, you already know the problem : rewriting Git history causes all identifiers (SHA) following the first affected commit to change, and [you cannot do a thing about it](https://stackoverflow.com/questions/64204804/dirty-trick-to-keep-commit-hashes-when-rewriting-git-history).
If one of the developers used to specify commit references in their own commit messages (like `This commit follows 40d5014 [...]`), they won't mean anything once rewriting is done.
Moreover, if some of your commits "revert" others, they are also affected (Git does not update them automatically).
### The workaround
So we have somehow to dynamically "update" commit references, while rewriting the history, according to new commit identifiers.
Below is a script implementing this, derived from one of the official GIT-FILTER-BRANCH(1) manual page examples, updating `root <root@localhost>` identity with `John Doe <john@example.net>` :
{% highlight sh %}
git filter-branch \
--env-filter '
if test "$GIT_AUTHOR_NAME" = "root"
then
GIT_AUTHOR_NAME="John Doe"
fi
if test "$GIT_AUTHOR_EMAIL" = "root@localhost"
then
GIT_AUTHOR_EMAIL=john@example.com
fi
if test "$GIT_COMMITTER_NAME" = "root"
then
GIT_COMMITTER_NAME="John Doe"
fi
if test "$GIT_COMMITTER_EMAIL" = "root@localhost"
then
GIT_COMMITTER_EMAIL=john@example.com
fi
' \
--commit-filter '
printf "%s" "${GIT_COMMIT}," >> ../commits_mapping
git commit-tree "$@" | tee -a ../commits_mapping
' \
--tag-name-filter cat \
--msg-filter '
message="$(cat)"
commit_refs="$(echo "$message" | LC_ALL=C grep -oE "\b[0-9a-fA-F]{7,40}\b")"
for commit_ref in $commit_refs; do
new_sha="$(grep "^${commit_ref}" ../commits_mapping | cut -d, -f2)"
if test -z "$new_sha"
then
continue;
fi
commit_ref_len="$(printf "%s" "$commit_ref" | wc -m)"
new_commit_ref="$(echo "$new_sha" | cut -c "1-${commit_ref_len}")"
message="$(echo "$message" | sed "s/${commit_ref}/${new_commit_ref}/g")"
done
echo "$message"
' \
-- --all
{% endhighlight %}
You may have noticed that filtering scripts are fully-POSIX compatible, so they are _supposed_ to work in most environments (maybe even yours :wink:).
You will find other features too :
* Committer identities are additionally getting updated ;
* **All** branches are getting rewritten (this may not be something that you want !) ;
* Tags are getting updated too (they will point to the same effective version of the code).
### A workaround pitfall
> **TL; DR** : beware of word collisions across commit messages.
There is a caveat that we have to share though, because of the use of regular expressions in the `msg-filter` script :
[![A missing blog post image](/img/blog/how-to-rewrite-git-history-while-keeping-message-commit-references_2.png)](https://www.explainxkcd.com/wiki/index.php?title=1171:_Perl_Problems)
You _might_ encounter collisions between commit references and real-life words, existing in your language.
For a project with commit messages written in English, you can safely run the above Git migration, because there is none :
{% highlight bash %}
LC_ALL=C grep -oE "\b[0-9a-fA-F]{7,40}\b" /usr/share/hunspell/en_US.dic
{% endhighlight %}
If you happened to use shorter SHA (let's say, 6-character long references), there **are** collisions in English :
{% highlight bash %}
LC_ALL=C grep -oE "\b[0-9a-fA-F]{6,40}\b" /usr/share/hunspell/en_US.dic
accede
bedded
cabbed
dabbed
decade
efface
facade
{% endhighlight %}
For an Italian project, there **are** collisions, even with 7-character long references (:fearful:) :
{% highlight bash %}
LC_ALL=C grep -oE "\b[0-9a-fA-F]{7,40}\b" /usr/share/hunspell/it_IT.dic
accadde
decadde
{% endhighlight %}
### Last words
Please also note that `git filter-branch` usage is [deprecated since Git v2.24.0](https://github.com/git/git/commit/9df53c5de6e687df9cd7b36e633360178b65a0ef), and [filter-repo](https://github.com/newren/git-filter-repo/) should be preferred.
~~If you managed to adapt the solution described in this post with this tool, feel free to post a comment below !~~
It actually appeared that [filter-repo supports this feature by default](https://github.com/newren/git-filter-repo/#design-rationale-behind-filter-repo) ! :tada:
So it definitely should be preferred over `git filter-branch`, but sometimes, only legacy tools are available...
---
> Many thanks to the co-author of this script, who will recognize himself :pray: