Documentation: add guides for people converting from filter-branch or BFG

Signed-off-by: Elijah Newren <newren@gmail.com>
pull/101/head
Elijah Newren 4 years ago
parent 4cfc765eb1
commit 5c4637ff81

@ -0,0 +1,156 @@
# Cheat Sheet: Converting from BFG Repo Cleaner
This document is aimed at folks who are familiar with BFG Repo Cleaner
and want to learn how to convert over to using filter-repo.
## Table of Contents
* [Half-hearted conversions](#half-hearted-conversions)
* [Intention of "equivalent" commands](#intention-of-equivalent-commands)
* [Basic Differences](#basic-differences)
* [Cheat Sheet: Conversion of Examples from BFG](#cheat-sheet-conversion-of-examples-from-bfg)
## Half-hearted conversions
You can switch most any BFG command to use filter-repo under the
covers by just replacing the `java -jar bfg.jar` part of the command
with [`bfg-ish`](../contrib/filter-repo-demos/bfg-ish).
bfg-ish is a reasonable tool, and provides a number of bug fixes and
features on top of bfg, but most of my focus is naturally on
filter-repo which has a number of capabilities lacking in bfg-ish.
## Intention of "equivalent" commands
BFG and filter-repo have a few differences, highlighted in the Basic
Differences section below, that make it hard to get commands that
behave identically. Rather than focusing on matching BFG output as
exactly as possible, I treat the BFG examples as idiomatic ways to
solve a certain type of problem with BFG, and express how one would
idiomatically solve the same problem in filter-repo. Sometimes that
means the results are not identical, but they are largely the same in
each case.
## Basic Differences
BFG operates directly on tree objects, which have no notion of their
leading path. Thus, it has no way of differentiating between
'README.md' at the toplevel versus in some subdirectory. You simply
operate on the basename of files and directories. This precludes
doing things like renaming files and directories or other bigger
restructures. By directly operating on trees, it also runs into
problems with loose vs. packed objects, loose vs. packed refs, not
understanding replace refs or grafts, and not understanding the index
and working tree as another data source.
With `git filter-repo`, you are essentially given an editing tool to
operate on the [fast-export](https://git-scm.com/docs/git-fast-export)
serialization of a repo, which operates on filenames including their
full paths from the toplevel of the repo. Directories are not
separately specified, so any directory-related filtering is done by
checking the leading path of each file. Further, you aren't limited
to the pre-defined filtering types, python callbacks which operate on
the data structures from the fast-export stream can be provided to do
just about anything you want. By leveraging fast-export and
fast-import, filter-repo gains automatic handling of objects and refs
whether they are packed or not, automatic handling of replace refs and
grafts, and future features that may appear. It also tries hard to
provide a full rewrite solution, so it takes care of additional
important concerns such as updating the index and working tree and
running an automatic gc for the user afterwards.
The "protection" and "privacy" defaults in BFG are something I
fundamentally disagreed with for a variety of reasons; see the
comments at the top of the
[bfg-ish](../contrib/filter-repo-demos/bfg-ish) script if you want
details. The bfg-ish script implemented these protection and privacy
options since it was designed to act like BFG, but still flipped the
default to the opposite of what BFG chose. This means a number of
things with filter-repo:
* any filters you specify will also be applied to HEAD, so that you
don't have a weird disconnect from your history transformations
only being applied to most commits
* `[formerly OLDHASH]` references are not munged into commit
messages; the replace refs that filter-repo adds are a much
cleaner way of looking up commits by old commit hashes.
* `Former-commit-id:` footers are not added to commit messages; the
replace refs that filter-repo adds are a much cleaner way of
looking up commits by old commit hashes.
* History is not littered with `<filename>.REMOVED.git-id` files.
BFG expects you to specify the repository to rewrite as its final
argument, whereas filter-repo expects you to cd into the repo and then
run filter-repo.
## Cheat Sheet: Conversion of Examples from BFG
### Stripping big blobs
```shell
java -jar bfg.jar --strip-blobs-bigger-than 100M some-big-repo.git
```
becomes
```shell
git filter-repo --strip-blobs-bigger-than 100M
```
### Deleting files
```shell
java -jar bfg.jar --delete-files id_{dsa,rsa} my-repo.git
```
becomes
```shell
git filter-repo --use-base-names --path id_dsa --path id_rsa --invert-paths
```
### Removing sensitive content
```shell
java -jar bfg.jar --replace-text passwords.txt my-repo.git
```
becomes
```shell
git filter-repo --replace-text passwords.txt
```
The `--replace-text` was a really clever idea that the BFG came up
with and I just implemented mostly as-is within filter-repo. Sadly,
BFG didn't document the format of files passed to --replace text very
well, but I added more detail in the filter-repo documentation.
There is one small but important difference between the two tools: if
you use both "regex:" and "==>" on a single line to specify a regex
search and replace, then filter-repo will use "\1", "\2", "\3",
etc. for replacement strings whereas BFG used "$1", "$2", "$3", etc.
The reason for this difference is simply that python used backslashes
in its regex format while scala used dollar signs, and both tools
wanted to just pass along the strings unmodified to the underlying
language. (Since bfg-ish attempts to emulate the BFG, it accepts
"$1", "$2" and so forth and translates them to "\1", "\2", etc. so
that filter-repo/python will understand it.)
### Removing files and folders with a certain name
```shell
java -jar bfg.jar --delete-folders .git --delete-files .git --no-blob-protection my-repo.git
```
becomes
```shell
git filter-repo --invert-paths --path-glob '*/.git' --path .git
```
Yes, that glob will handle .git directories one or more directories
deep; it's a git-style glob rather than a shell-style glob. Also, the
`--path .git` was added because `--path-glob '*/.git'` won't match a
directory named .git in the toplevel directory since it has a '/'
character in the glob expression (though I would hope the repository
doesn't have a tracked .git toplevel directory in its history).

@ -0,0 +1,310 @@
# Cheat Sheet: Converting from filter-branch
This document is aimed at folks who are familiar with filter-branch and want
to learn how to convert over to using filter-repo.
## Table of Contents
* [Half-hearted conversions](#half-hearted-conversions)
* [Intention of "equivalent" commands](#intention-of-equivalent-commands)
* [Basic Differences](#basic-differences)
* [Cheat Sheet: Conversion of Examples from the filter-branch manpage](#cheat-sheet-conversion-of-examples-from-the-filter-branch-manpage)
## Half-hearted conversions
You can switch nearly any `git filter-branch` command to use
filter-repo under the covers by just replacing the `git filter-branch`
part of the command with
[`filter-lamely`](../contrib/filter-repo-demos/filter-lamely). The
git.git regression testsuite passes when I swap out the filter-branch
script with filter-lamely, for example. (However, the filter-branch
tests are not very comprehensive, so don't rely on that too much.)
Doing a half-hearted conversion has nearly all of the drawbacks of
filter-branch and nearly none of the benefits of filter-repo, but it
will make your command run a few times faster and makes for a very
simple conversion.
You'll get a lot more performance, safety, and features by just
switching to direct filter-repo commands.
## Intention of "equivalent" commands
filter-branch and filter-repo have different defaults, as highlighted
in the Basic Differences section below. As such, getting a command
which behaves identically is not possible. Also, sometimes the
filter-branch manpage lies, e.g. it says "suppose you want to...from
all commits" and then uses a command line like "git filter-branch
... HEAD", which only operates on commits in the current branch rather
than on all commits.
Rather than focusing on matching filter-branch output as exactly as
possible, I treat the filter-branch examples as idiomatic ways to
solve a certain type of problem with filter-branch, and express how
one would idiomatically solve the same problem in filter-repo.
Sometimes that means the results are not identical, but they are
largely the same in each case.
## Basic Differences
With `git filter-branch`, you have a git repository where every single
commit (within the branches or revisions you specify) is checked out
and then you run one or more shell commands to transform the working
copy into your desired end state.
With `git filter-repo`, you are essentially given an editing tool to
operate on the [fast-export](https://git-scm.com/docs/git-fast-export)
serialization of a repo. That means there is an input stream of all
the contents of the repository, and rather than specifying filters in
the form of commands to run, you usually employ a number of common
pre-defined filters that provide various ways to slice, dice, or
modify the repo based on its components (such as pathnames, file
content, user names or emails, etc.) That makes common operations
easier, even if it's not as versatile as shell callbacks. For cases
where more complexity or special casing is needed, filter-repo
provides python callbacks that can operate on the data structures
populated from the fast-export stream to do just about anything you
want.
filter-branch defaults to working on a subset of the repository, and
requires you to specify a branch or branches, meaning you need to
specify `-- --all` to modify all commits. filter-repo by contrast
defaults to rewriting everything, and you need to specify `--refs
<rev-list-args>` if you want to limit to just a certain set of
branches or range of commits. (Though any `<rev-list-args>` that
begin with a hyphen are not accepted by filter-repo as they look like
the start of different options.)
filter-repo also takes care of additional concerns automatically, like
rewriting commit messages that reference old commit IDs to instead
reference the rewritten commit IDs, pruning commits which do not start
empty but become empty due to the specified filters, and automatically
shrinking and gc'ing the repo at the end of the filtering operation.
## Cheat Sheet: Conversion of Examples from the filter-branch manpage
### Removing a file
The filter-branch manual provided three different examples of removing
a single file, based on different levels of ease vs. carefulness and
performance:
```shell
git filter-branch --tree-filter 'rm filename' HEAD
```
```shell
git filter-branch --tree-filter 'rm -f filename' HEAD
```
```shell
git filter-branch --index-filter 'git rm --cached --ignore-unmatch filename' HEAD
```
All of these just become
```shell
git filter-repo --invert-paths --path filename
```
### Extracting a subdirectory
Extracting a subdirectory via
```shell
git filter-branch --subdirectory-filter foodir -- --all
```
is one of the easiest commands to convert; it just becomes
```shell
git filter-repo --subdirectory-filter foodir
```
### Moving the whole tree into a subdirectory
Keeping all files but placing them in a new subdirectory via
```shell
git filter-branch --index-filter \
'git ls-files -s | sed "s-\t\"*-&newsubdir/-" |
GIT_INDEX_FILE=$GIT_INDEX_FILE.new \
git update-index --index-info &&
mv "$GIT_INDEX_FILE.new" "$GIT_INDEX_FILE"' HEAD
```
(which happens to be GNU-specific and will fail with BSD userland in
very subtle ways) becomes
```shell
git filter-repo --to-subdirectory-filter newsubdir
```
(which works fine regardless of GNU vs BSD userland differences.)
### Re-grafting history
The filter-branch manual provided one example with three different
commands that could be used to achieve it, though the first of them
had limited applicability (only when the repo had a single initial
commit). These three examples were:
```shell
git filter-branch --parent-filter 'sed "s/^\$/-p <graft-id>/"' HEAD
```
```shell
git filter-branch --parent-filter \
'test $GIT_COMMIT = <commit-id> && echo "-p <graft-id>" || cat' HEAD
```
```shell
git replace --graft $commit-id $graft-id
git filter-branch $graft-id..HEAD
```
git-replace did not exist when the original two examples were written,
but it is clear that the last example is far easier to understand. As
such, filter-repo just uses the same mechanism:
```shell
git replace --graft $commit-id $graft-id
git filter-repo --force
```
NOTE: --force should usually be avoided unless you have taken care to
make sure you have a backup (or are running on a fresh clone of) your
repo. It is needed in this case because filter-repo errors out when
no arguments are specified, and because it usually first checks
whether you are in a fresh clone before irrecoverably rewriting your
repository (git-replace created a new graft and thus added something
to your previously fresh clone).
### Removing commits by a certain author
WARNING: This is a BAD example for BOTH filter-branch and filter-repo.
It does not remove the changes the user made from the repo, it just
removes the commit in question while smashing the changes from it into
any subsequent commits as though the subsequent authors had been
responsible for those changes as well. `git rebase` is likely to be a
better fit for what you really want if you are looking at this
example. (See also [this explanation of the differences between
rebase and
filter-repo](https://github.com/newren/git-filter-repo/issues/62#issuecomment-597725502))
This filter-branch example
```shell
git filter-branch --commit-filter '
if [ "$GIT_AUTHOR_NAME" = "Darl McBribe" ];
then
skip_commit "$@";
else
git commit-tree "$@";
fi' HEAD
```
becomes
```shell
git filter-repo --commit-callback '
if commit.author_name == b"Darl McBribe":
commit.skip()
'
```
### Rewriting commit messages -- removing text
Removing git-svn-id: lines from commit messages via
```shell
git filter-branch --msg-filter '
sed -e "/^git-svn-id:/d"
'
```
becomes
```shell
git filter-repo --message-callback '
return re.sub(b"^git-svn-id:.*\n", b"", message, flags=re.MULTILINE)
'
```
### Rewriting commit messages -- adding text
Adding Acked-by lines to the last ten commits via
```shell
git filter-branch --msg-filter '
cat &&
echo "Acked-by: Bugs Bunny <bunny@bugzilla.org>"
' master~10..master
```
becomes
```shell
git filter-repo --message-callback '
return message + b"Acked-by: Bugs Bunny <bunny@bugzilla.org>\n"
' --refs master~10..master
```
### Changing author/committer(/tagger?) information
```shell
git filter-branch --env-filter '
if test "$GIT_AUTHOR_EMAIL" = "root@localhost"
then
GIT_AUTHOR_EMAIL=john@example.com
fi
if test "$GIT_COMMITTER_EMAIL" = "root@localhost"
then
GIT_COMMITTER_EMAIL=john@example.com
fi
' -- --all
```
becomes either
```shell
# Ensure '<john@example.com> <root@localhost>' is a line in .mailmap, then:
git filter-repo --use-mailmap
```
or
```shell
git filter-repo --email-callback '
return email if email != b"root@localhost" else b"john@example.com"
'
```
(and as a bonus will fix tagger emails too, unlike the filter-branch one)
### Restricting to a range
The partial examples
```shell
git filter-branch ... C..H
```
```shell
git filter-branch ... C..H ^D
```
```shell
git filter-branch ... D..H ^C
```
become
```shell
git filter-repo ... --refs C..H
```
```shell
git filter-repo ... --refs C..H ^D
```
```shell
git filter-repo ... --refs D..H ^C
```
Note that filter-branch accepts `--not` among the revision specifiers,
but that appears to python to be a flag name which breaks parsing.
So, instead of e.g. `--not C` as we might use with filter-branch, we
can specify `^C` to filter-repo.

@ -100,6 +100,11 @@ but some highlights for the main competitors:
more performant (though not nearly as fast or safe as
filter-repo).
* a [cheat
sheet](Documentation/converting-from-filter-branch.md#cheat-sheet-conversion-of-examples-from-the-filter-branch-manpage)
is available showing how to convert example commands from the manual of
filter-branch into filter-repo commands.
## BFG Repo Cleaner
* great tool for its time, but while it makes some things simple, it
@ -116,6 +121,11 @@ but some highlights for the main competitors:
based on filter-repo which includes several new features and bugfixes
relative to bfg.
* a [cheat
sheet](Documentation/converting-from-bfg-repo-cleaner.md#cheat-sheet-conversion-of-examples-from-bfg)
is available showing how to convert example commands from the manual of
BFG Repo Cleaner into filter-repo commands.
# Simple example, with comparisons
Let's say that we want to extract a piece of a repository, with the intent

Loading…
Cancel
Save