Quickly rewrite git repository history (filter-branch replacement)
You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
Go to file
Elijah Newren a38eb1c3a3 Rename t9302 to t9391 5 years ago
testcases Rename t9302 to t9391 5 years ago
COPYING Add a few documentation files 15 years ago
README.md Add README.md explaing new repo-filter tool 6 years ago
TODO Temporary untracked files 5 years ago
compare-repos Make compare-repos work on directories with spaces in their name 15 years ago
demo-stuff Temporary untracked files 5 years ago
differences Temporary untracked files 5 years ago
git-repo-filter Add RepoFilter.finish() function for split RepoFilter cases 5 years ago
performance-notes.txt Temporary untracked files 5 years ago

README.md

git repo-filter is intended to be a tool similar to git filter-branch for rewriting repository history. While filter-branch is relatively quick to learn and invoke and is relatively versatile, it has a few glaring deficiencies. repo-filter tries to copy filter-branch's good qualities, while bringing a significant performance boost and a different taste in usability.

Table of Contents

Background

Why git-repo-filter?

None of the existing repository filtering tools do what I want. They're all good in their own way, but come up short for my needs. In no particular order:

  1. [Starting report] Provide user an analysis of their repo to help them get started on what to prune or rename, instead of expecting them to guess or find other tools to figure it out. (Triggered, e.g. by running the first time with a special flag, such as --analyze.)

  2. [Keep vs. remove] Instead of just providing a way for users to easily remove selected paths, also provide flags for users to only keep certain paths. Sure, users could workaround this by specifying to remove all paths other than the ones they want to keep, but the need to specify all paths that ever existed in any version of the repository could sometimes be quite painful. For filter-branch, using pipelines like git ls-files | grep -v ... | xargs -r git rm might be a reasonable workaround but can get unwieldy and isn't as straightforward for users.

  3. [Renaming] It should be easy to rename paths. For example, in addition to allowing one to treat some subdirectory as the root of the repository, also provide options for users to make the root of the repository just become a subdirectory. And more generally allow files and directories to be easily renamed. Provide sanity checks if renaming causes multiple files to exist at the same path. (And add special handling so that if a commit merely renamed oldname->newname, then filtering oldname->newname doesn't trigger the sanity check and die on that commit.)

  4. [More intelligent safety] Writing copies of the original refs to a special namespace within the repo does not provide a user-friendly recovery mechanism. Many would struggle to recover using that. Almost everyone I've ever seen do a repository filtering operation has done so with a fresh clone, because wiping out the clone in case of error is a vastly easier recovery mechanism. Strongly encourage that workflow by detecting and bailing if we're not in a fresh clone, unless the user overrides with --force. (Allow the old filter-branch workflow if a special --store-backup flag is provided.)

  5. [Auto shrink] Automatically remove old cruft and repack the repository for the user after filtering (unless overridden)

  6. [Clean separation] Avoid confusing users (and prevent accidental re-pushing of old stuff) due to mixing old repo and rewritten repo together. (This is particularly a problem with filter-branch when using the --tag-name-filter option, and sometimes also an issue when only filtering a subset of branches.)

  7. [Commit message consistency] If commit messages refer to other commits by ID (e.g. "this reverts commit 01234567890abcdef", "In commit 0013deadbeef9a..."), those commit messages should be rewritten to refer to the new commit IDs.

  8. [Empty pruning] Commits which become empty due to filtering should be pruned. That includes merge commits which become empty (e.g. when grabbing the history of a single directory that hasn't always existed within the repo; I don't want thousands of unrelated commits that pre-dated the introduction of that directory). However, I do not want commits which were empty in the original repository to be pruned, though.

  9. [Speed] Filtering should be reasonably fast

Warnings: Not yet ready for external usage

This repository is still under heavy construction. Some caveats:

  • It will not work without a specially compiled version of git:
    • git clone --branch fast-export-import-improvements https://github.com/newren/git/
    • Build according to normal git.git build instructions. You can find 'em.
  • I have a list of known bugs, conveniently mostly tracked in my head. I'll fix that, but the fact that you're reading this sentence means I haven't yet.
  • Actually, there's a couple exceptions to where bugs are tracked mentioned above. In particular, the following bugs are tracked here:
    • Multiple unimplemented placeholder option flags exist. Just because it shows up in --help doesn't mean it does anything.
    • Usage instructions and examples at the end of this document are rather lacking.
    • Random debugging code or extraneous files might be checked in at any given time; I'll probably rewrite history to remove them...eventually.
  • I reserve the right to:
    • Rename the tool altogether (filter-repo to be like filter-branch?)
    • Rename or redefine any command line options
    • Rewrite the history of this repository at any time
  • and possibly more...but do you really need any more reasons than the above? This isn't ready for widespread use.

Why not $FAVORITE_COMPETITOR?

Here are some of the prominent competitors I know of:

Here's why I think these tools don't meet my needs:

  • git_fast_filter.py:

    • This was actually the basis for repo-filter, though it required lots of additional work.
    • Was meant as a library more than a tool, and had too high of an activation energy.
    • empty commit pruning was not as thorough as it should have been
    • had no provision for commit message rewriting for commit message consistency.
    • missing lots of little conveniences
  • reposurgeon

    • focused on converting repositories between version control systems, and handles all the crazy impedance mismatches inherent in such conversions. I only care about rewriting history that starts in git and ends in git. If you care about converting between version control systems, though, reposurgeon is a much better tool.
    • might be general enough to use for other uses, but can't find any documentation or examples on anything other than huge repository conversions between version control systems.
    • way too much effort for many simple repository rewrites that many users want to perform
  • BFG repo cleaner

    • Very focused on just removing crazy big files and sensitive data. Probably the best tool if that's all you want. But lacks the ability to handle anything outside this special (but important!) usecase.
    • Has useful options for helping you remove the N biggest blobs, but nothing to help you know how big N should be.
    • Doesn't prune commits which become empty due to filtering; if you just want to extract a directory added 3 months ago and its history, you'd be stuck with years of commits touching other directories, all empty.
    • The refusal to rewrite HEAD, while it makes sense when trying to remove a few crazy big files and sensitive data (users tend to re-add and re-commit bad files if you didn't manually remove it and have them update), is totally misaligned with more general rewrite cases (e.g. the desire to turn a subdirectory into the root of a repository, or move the root of the repository into a subdirectory for merging into some other bigger repo.)
    • Telling the user how to shrink the repo afterwards seems lame since that was the whole point; just do it for them by default.
  • git filter-branch

    • Fundamental design flaw causing it to be orders of magnitude slower than it should be for most repo rewriting jobs. So slow that it becomes a major usability impediment, if not a deal breaker. However, it is extremely versatile.
    • Generally quick for users to invoke (quick one-liners with lots of examples), just missing some useful capabilities like selecting wanted paths (as opposed to unwanted paths) and providing easier path renaming (also, e.g. no --to-subdirectory-filter as the opposite of --subdirectory-filter)
    • Doesn't rewrite commit hashes in commit messages, causing commit messages to refer to phantom commits instead.
    • Mixes old repository information (original tags, unrewritten branches) with new, risking re-pushing the old stuff
    • Lame defaults
      • --prune-empty should be default (although only commits which become empty, not ones which started empty)
      • allows user to mess with repos which aren't a clean clone without an override
      • Makes it very difficult to actually get rid of unwanted objects and shrink repository. Long multi-step instructions in manpage for this, which are incomplete when --tag-name-filter is in use.

Usage

Run git repo-filter --help and figure it out from there. Good luck.