Documentation: add guides for people converting from filter-branch or BFG
Signed-off-by: Elijah Newren <newren@gmail.com>pull/101/head
parent
4cfc765eb1
commit
5c4637ff81
@ -0,0 +1,156 @@
|
||||
# Cheat Sheet: Converting from BFG Repo Cleaner
|
||||
|
||||
This document is aimed at folks who are familiar with BFG Repo Cleaner
|
||||
and want to learn how to convert over to using filter-repo.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
* [Half-hearted conversions](#half-hearted-conversions)
|
||||
* [Intention of "equivalent" commands](#intention-of-equivalent-commands)
|
||||
* [Basic Differences](#basic-differences)
|
||||
* [Cheat Sheet: Conversion of Examples from BFG](#cheat-sheet-conversion-of-examples-from-bfg)
|
||||
|
||||
## Half-hearted conversions
|
||||
|
||||
You can switch most any BFG command to use filter-repo under the
|
||||
covers by just replacing the `java -jar bfg.jar` part of the command
|
||||
with [`bfg-ish`](../contrib/filter-repo-demos/bfg-ish).
|
||||
|
||||
bfg-ish is a reasonable tool, and provides a number of bug fixes and
|
||||
features on top of bfg, but most of my focus is naturally on
|
||||
filter-repo which has a number of capabilities lacking in bfg-ish.
|
||||
|
||||
## Intention of "equivalent" commands
|
||||
|
||||
BFG and filter-repo have a few differences, highlighted in the Basic
|
||||
Differences section below, that make it hard to get commands that
|
||||
behave identically. Rather than focusing on matching BFG output as
|
||||
exactly as possible, I treat the BFG examples as idiomatic ways to
|
||||
solve a certain type of problem with BFG, and express how one would
|
||||
idiomatically solve the same problem in filter-repo. Sometimes that
|
||||
means the results are not identical, but they are largely the same in
|
||||
each case.
|
||||
|
||||
## Basic Differences
|
||||
|
||||
BFG operates directly on tree objects, which have no notion of their
|
||||
leading path. Thus, it has no way of differentiating between
|
||||
'README.md' at the toplevel versus in some subdirectory. You simply
|
||||
operate on the basename of files and directories. This precludes
|
||||
doing things like renaming files and directories or other bigger
|
||||
restructures. By directly operating on trees, it also runs into
|
||||
problems with loose vs. packed objects, loose vs. packed refs, not
|
||||
understanding replace refs or grafts, and not understanding the index
|
||||
and working tree as another data source.
|
||||
|
||||
With `git filter-repo`, you are essentially given an editing tool to
|
||||
operate on the [fast-export](https://git-scm.com/docs/git-fast-export)
|
||||
serialization of a repo, which operates on filenames including their
|
||||
full paths from the toplevel of the repo. Directories are not
|
||||
separately specified, so any directory-related filtering is done by
|
||||
checking the leading path of each file. Further, you aren't limited
|
||||
to the pre-defined filtering types, python callbacks which operate on
|
||||
the data structures from the fast-export stream can be provided to do
|
||||
just about anything you want. By leveraging fast-export and
|
||||
fast-import, filter-repo gains automatic handling of objects and refs
|
||||
whether they are packed or not, automatic handling of replace refs and
|
||||
grafts, and future features that may appear. It also tries hard to
|
||||
provide a full rewrite solution, so it takes care of additional
|
||||
important concerns such as updating the index and working tree and
|
||||
running an automatic gc for the user afterwards.
|
||||
|
||||
The "protection" and "privacy" defaults in BFG are something I
|
||||
fundamentally disagreed with for a variety of reasons; see the
|
||||
comments at the top of the
|
||||
[bfg-ish](../contrib/filter-repo-demos/bfg-ish) script if you want
|
||||
details. The bfg-ish script implemented these protection and privacy
|
||||
options since it was designed to act like BFG, but still flipped the
|
||||
default to the opposite of what BFG chose. This means a number of
|
||||
things with filter-repo:
|
||||
* any filters you specify will also be applied to HEAD, so that you
|
||||
don't have a weird disconnect from your history transformations
|
||||
only being applied to most commits
|
||||
* `[formerly OLDHASH]` references are not munged into commit
|
||||
messages; the replace refs that filter-repo adds are a much
|
||||
cleaner way of looking up commits by old commit hashes.
|
||||
* `Former-commit-id:` footers are not added to commit messages; the
|
||||
replace refs that filter-repo adds are a much cleaner way of
|
||||
looking up commits by old commit hashes.
|
||||
* History is not littered with `<filename>.REMOVED.git-id` files.
|
||||
|
||||
BFG expects you to specify the repository to rewrite as its final
|
||||
argument, whereas filter-repo expects you to cd into the repo and then
|
||||
run filter-repo.
|
||||
|
||||
## Cheat Sheet: Conversion of Examples from BFG
|
||||
|
||||
### Stripping big blobs
|
||||
|
||||
```shell
|
||||
java -jar bfg.jar --strip-blobs-bigger-than 100M some-big-repo.git
|
||||
```
|
||||
|
||||
becomes
|
||||
|
||||
```shell
|
||||
git filter-repo --strip-blobs-bigger-than 100M
|
||||
```
|
||||
|
||||
### Deleting files
|
||||
|
||||
```shell
|
||||
java -jar bfg.jar --delete-files id_{dsa,rsa} my-repo.git
|
||||
```
|
||||
|
||||
becomes
|
||||
|
||||
```shell
|
||||
git filter-repo --use-base-names --path id_dsa --path id_rsa --invert-paths
|
||||
```
|
||||
|
||||
### Removing sensitive content
|
||||
|
||||
```shell
|
||||
java -jar bfg.jar --replace-text passwords.txt my-repo.git
|
||||
```
|
||||
|
||||
becomes
|
||||
|
||||
```shell
|
||||
git filter-repo --replace-text passwords.txt
|
||||
```
|
||||
|
||||
The `--replace-text` was a really clever idea that the BFG came up
|
||||
with and I just implemented mostly as-is within filter-repo. Sadly,
|
||||
BFG didn't document the format of files passed to --replace text very
|
||||
well, but I added more detail in the filter-repo documentation.
|
||||
|
||||
There is one small but important difference between the two tools: if
|
||||
you use both "regex:" and "==>" on a single line to specify a regex
|
||||
search and replace, then filter-repo will use "\1", "\2", "\3",
|
||||
etc. for replacement strings whereas BFG used "$1", "$2", "$3", etc.
|
||||
The reason for this difference is simply that python used backslashes
|
||||
in its regex format while scala used dollar signs, and both tools
|
||||
wanted to just pass along the strings unmodified to the underlying
|
||||
language. (Since bfg-ish attempts to emulate the BFG, it accepts
|
||||
"$1", "$2" and so forth and translates them to "\1", "\2", etc. so
|
||||
that filter-repo/python will understand it.)
|
||||
|
||||
### Removing files and folders with a certain name
|
||||
|
||||
```shell
|
||||
java -jar bfg.jar --delete-folders .git --delete-files .git --no-blob-protection my-repo.git
|
||||
```
|
||||
|
||||
becomes
|
||||
|
||||
```shell
|
||||
git filter-repo --invert-paths --path-glob '*/.git' --path .git
|
||||
```
|
||||
|
||||
Yes, that glob will handle .git directories one or more directories
|
||||
deep; it's a git-style glob rather than a shell-style glob. Also, the
|
||||
`--path .git` was added because `--path-glob '*/.git'` won't match a
|
||||
directory named .git in the toplevel directory since it has a '/'
|
||||
character in the glob expression (though I would hope the repository
|
||||
doesn't have a tracked .git toplevel directory in its history).
|
@ -0,0 +1,310 @@
|
||||
# Cheat Sheet: Converting from filter-branch
|
||||
|
||||
This document is aimed at folks who are familiar with filter-branch and want
|
||||
to learn how to convert over to using filter-repo.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
* [Half-hearted conversions](#half-hearted-conversions)
|
||||
* [Intention of "equivalent" commands](#intention-of-equivalent-commands)
|
||||
* [Basic Differences](#basic-differences)
|
||||
* [Cheat Sheet: Conversion of Examples from the filter-branch manpage](#cheat-sheet-conversion-of-examples-from-the-filter-branch-manpage)
|
||||
|
||||
## Half-hearted conversions
|
||||
|
||||
You can switch nearly any `git filter-branch` command to use
|
||||
filter-repo under the covers by just replacing the `git filter-branch`
|
||||
part of the command with
|
||||
[`filter-lamely`](../contrib/filter-repo-demos/filter-lamely). The
|
||||
git.git regression testsuite passes when I swap out the filter-branch
|
||||
script with filter-lamely, for example. (However, the filter-branch
|
||||
tests are not very comprehensive, so don't rely on that too much.)
|
||||
|
||||
Doing a half-hearted conversion has nearly all of the drawbacks of
|
||||
filter-branch and nearly none of the benefits of filter-repo, but it
|
||||
will make your command run a few times faster and makes for a very
|
||||
simple conversion.
|
||||
|
||||
You'll get a lot more performance, safety, and features by just
|
||||
switching to direct filter-repo commands.
|
||||
|
||||
## Intention of "equivalent" commands
|
||||
|
||||
filter-branch and filter-repo have different defaults, as highlighted
|
||||
in the Basic Differences section below. As such, getting a command
|
||||
which behaves identically is not possible. Also, sometimes the
|
||||
filter-branch manpage lies, e.g. it says "suppose you want to...from
|
||||
all commits" and then uses a command line like "git filter-branch
|
||||
... HEAD", which only operates on commits in the current branch rather
|
||||
than on all commits.
|
||||
|
||||
Rather than focusing on matching filter-branch output as exactly as
|
||||
possible, I treat the filter-branch examples as idiomatic ways to
|
||||
solve a certain type of problem with filter-branch, and express how
|
||||
one would idiomatically solve the same problem in filter-repo.
|
||||
Sometimes that means the results are not identical, but they are
|
||||
largely the same in each case.
|
||||
|
||||
## Basic Differences
|
||||
|
||||
With `git filter-branch`, you have a git repository where every single
|
||||
commit (within the branches or revisions you specify) is checked out
|
||||
and then you run one or more shell commands to transform the working
|
||||
copy into your desired end state.
|
||||
|
||||
With `git filter-repo`, you are essentially given an editing tool to
|
||||
operate on the [fast-export](https://git-scm.com/docs/git-fast-export)
|
||||
serialization of a repo. That means there is an input stream of all
|
||||
the contents of the repository, and rather than specifying filters in
|
||||
the form of commands to run, you usually employ a number of common
|
||||
pre-defined filters that provide various ways to slice, dice, or
|
||||
modify the repo based on its components (such as pathnames, file
|
||||
content, user names or emails, etc.) That makes common operations
|
||||
easier, even if it's not as versatile as shell callbacks. For cases
|
||||
where more complexity or special casing is needed, filter-repo
|
||||
provides python callbacks that can operate on the data structures
|
||||
populated from the fast-export stream to do just about anything you
|
||||
want.
|
||||
|
||||
filter-branch defaults to working on a subset of the repository, and
|
||||
requires you to specify a branch or branches, meaning you need to
|
||||
specify `-- --all` to modify all commits. filter-repo by contrast
|
||||
defaults to rewriting everything, and you need to specify `--refs
|
||||
<rev-list-args>` if you want to limit to just a certain set of
|
||||
branches or range of commits. (Though any `<rev-list-args>` that
|
||||
begin with a hyphen are not accepted by filter-repo as they look like
|
||||
the start of different options.)
|
||||
|
||||
filter-repo also takes care of additional concerns automatically, like
|
||||
rewriting commit messages that reference old commit IDs to instead
|
||||
reference the rewritten commit IDs, pruning commits which do not start
|
||||
empty but become empty due to the specified filters, and automatically
|
||||
shrinking and gc'ing the repo at the end of the filtering operation.
|
||||
|
||||
## Cheat Sheet: Conversion of Examples from the filter-branch manpage
|
||||
|
||||
### Removing a file
|
||||
|
||||
The filter-branch manual provided three different examples of removing
|
||||
a single file, based on different levels of ease vs. carefulness and
|
||||
performance:
|
||||
|
||||
```shell
|
||||
git filter-branch --tree-filter 'rm filename' HEAD
|
||||
```
|
||||
```shell
|
||||
git filter-branch --tree-filter 'rm -f filename' HEAD
|
||||
```
|
||||
```shell
|
||||
git filter-branch --index-filter 'git rm --cached --ignore-unmatch filename' HEAD
|
||||
```
|
||||
|
||||
All of these just become
|
||||
|
||||
```shell
|
||||
git filter-repo --invert-paths --path filename
|
||||
```
|
||||
|
||||
### Extracting a subdirectory
|
||||
|
||||
Extracting a subdirectory via
|
||||
|
||||
```shell
|
||||
git filter-branch --subdirectory-filter foodir -- --all
|
||||
```
|
||||
|
||||
is one of the easiest commands to convert; it just becomes
|
||||
|
||||
```shell
|
||||
git filter-repo --subdirectory-filter foodir
|
||||
```
|
||||
|
||||
### Moving the whole tree into a subdirectory
|
||||
|
||||
Keeping all files but placing them in a new subdirectory via
|
||||
|
||||
```shell
|
||||
git filter-branch --index-filter \
|
||||
'git ls-files -s | sed "s-\t\"*-&newsubdir/-" |
|
||||
GIT_INDEX_FILE=$GIT_INDEX_FILE.new \
|
||||
git update-index --index-info &&
|
||||
mv "$GIT_INDEX_FILE.new" "$GIT_INDEX_FILE"' HEAD
|
||||
```
|
||||
|
||||
(which happens to be GNU-specific and will fail with BSD userland in
|
||||
very subtle ways) becomes
|
||||
|
||||
```shell
|
||||
git filter-repo --to-subdirectory-filter newsubdir
|
||||
```
|
||||
|
||||
(which works fine regardless of GNU vs BSD userland differences.)
|
||||
|
||||
### Re-grafting history
|
||||
|
||||
The filter-branch manual provided one example with three different
|
||||
commands that could be used to achieve it, though the first of them
|
||||
had limited applicability (only when the repo had a single initial
|
||||
commit). These three examples were:
|
||||
```shell
|
||||
git filter-branch --parent-filter 'sed "s/^\$/-p <graft-id>/"' HEAD
|
||||
```
|
||||
```shell
|
||||
git filter-branch --parent-filter \
|
||||
'test $GIT_COMMIT = <commit-id> && echo "-p <graft-id>" || cat' HEAD
|
||||
```
|
||||
```shell
|
||||
git replace --graft $commit-id $graft-id
|
||||
git filter-branch $graft-id..HEAD
|
||||
```
|
||||
|
||||
git-replace did not exist when the original two examples were written,
|
||||
but it is clear that the last example is far easier to understand. As
|
||||
such, filter-repo just uses the same mechanism:
|
||||
|
||||
```shell
|
||||
git replace --graft $commit-id $graft-id
|
||||
git filter-repo --force
|
||||
```
|
||||
|
||||
NOTE: --force should usually be avoided unless you have taken care to
|
||||
make sure you have a backup (or are running on a fresh clone of) your
|
||||
repo. It is needed in this case because filter-repo errors out when
|
||||
no arguments are specified, and because it usually first checks
|
||||
whether you are in a fresh clone before irrecoverably rewriting your
|
||||
repository (git-replace created a new graft and thus added something
|
||||
to your previously fresh clone).
|
||||
|
||||
### Removing commits by a certain author
|
||||
|
||||
WARNING: This is a BAD example for BOTH filter-branch and filter-repo.
|
||||
It does not remove the changes the user made from the repo, it just
|
||||
removes the commit in question while smashing the changes from it into
|
||||
any subsequent commits as though the subsequent authors had been
|
||||
responsible for those changes as well. `git rebase` is likely to be a
|
||||
better fit for what you really want if you are looking at this
|
||||
example. (See also [this explanation of the differences between
|
||||
rebase and
|
||||
filter-repo](https://github.com/newren/git-filter-repo/issues/62#issuecomment-597725502))
|
||||
|
||||
This filter-branch example
|
||||
|
||||
```shell
|
||||
git filter-branch --commit-filter '
|
||||
if [ "$GIT_AUTHOR_NAME" = "Darl McBribe" ];
|
||||
then
|
||||
skip_commit "$@";
|
||||
else
|
||||
git commit-tree "$@";
|
||||
fi' HEAD
|
||||
```
|
||||
|
||||
becomes
|
||||
|
||||
```shell
|
||||
git filter-repo --commit-callback '
|
||||
if commit.author_name == b"Darl McBribe":
|
||||
commit.skip()
|
||||
'
|
||||
```
|
||||
|
||||
### Rewriting commit messages -- removing text
|
||||
|
||||
Removing git-svn-id: lines from commit messages via
|
||||
|
||||
```shell
|
||||
git filter-branch --msg-filter '
|
||||
sed -e "/^git-svn-id:/d"
|
||||
'
|
||||
```
|
||||
|
||||
becomes
|
||||
|
||||
```shell
|
||||
git filter-repo --message-callback '
|
||||
return re.sub(b"^git-svn-id:.*\n", b"", message, flags=re.MULTILINE)
|
||||
'
|
||||
```
|
||||
|
||||
### Rewriting commit messages -- adding text
|
||||
|
||||
Adding Acked-by lines to the last ten commits via
|
||||
|
||||
```shell
|
||||
git filter-branch --msg-filter '
|
||||
cat &&
|
||||
echo "Acked-by: Bugs Bunny <bunny@bugzilla.org>"
|
||||
' master~10..master
|
||||
```
|
||||
|
||||
becomes
|
||||
|
||||
```shell
|
||||
git filter-repo --message-callback '
|
||||
return message + b"Acked-by: Bugs Bunny <bunny@bugzilla.org>\n"
|
||||
' --refs master~10..master
|
||||
```
|
||||
|
||||
### Changing author/committer(/tagger?) information
|
||||
|
||||
```shell
|
||||
git filter-branch --env-filter '
|
||||
if test "$GIT_AUTHOR_EMAIL" = "root@localhost"
|
||||
then
|
||||
GIT_AUTHOR_EMAIL=john@example.com
|
||||
fi
|
||||
if test "$GIT_COMMITTER_EMAIL" = "root@localhost"
|
||||
then
|
||||
GIT_COMMITTER_EMAIL=john@example.com
|
||||
fi
|
||||
' -- --all
|
||||
```
|
||||
|
||||
becomes either
|
||||
|
||||
```shell
|
||||
# Ensure '<john@example.com> <root@localhost>' is a line in .mailmap, then:
|
||||
git filter-repo --use-mailmap
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```shell
|
||||
git filter-repo --email-callback '
|
||||
return email if email != b"root@localhost" else b"john@example.com"
|
||||
'
|
||||
```
|
||||
|
||||
(and as a bonus will fix tagger emails too, unlike the filter-branch one)
|
||||
|
||||
|
||||
### Restricting to a range
|
||||
|
||||
The partial examples
|
||||
|
||||
```shell
|
||||
git filter-branch ... C..H
|
||||
```
|
||||
```shell
|
||||
git filter-branch ... C..H ^D
|
||||
```
|
||||
```shell
|
||||
git filter-branch ... D..H ^C
|
||||
```
|
||||
|
||||
become
|
||||
|
||||
```shell
|
||||
git filter-repo ... --refs C..H
|
||||
```
|
||||
```shell
|
||||
git filter-repo ... --refs C..H ^D
|
||||
```
|
||||
```shell
|
||||
git filter-repo ... --refs D..H ^C
|
||||
```
|
||||
|
||||
Note that filter-branch accepts `--not` among the revision specifiers,
|
||||
but that appears to python to be a flag name which breaks parsing.
|
||||
So, instead of e.g. `--not C` as we might use with filter-branch, we
|
||||
can specify `^C` to filter-repo.
|
Loading…
Reference in New Issue