Commit Graph

320 Commits (main)

Author SHA1 Message Date
Elijah Newren cbc6535694 filter-repo: pass raw bytestring to regex compilation
Signed-off-by: Elijah Newren <newren@gmail.com>
2 years ago
Elijah Newren 4a416be87b Merge branch 'ri/mailmapping-empty-email-addresses'
Signed-off-by: Elijah Newren <newren@gmail.com>
2 years ago
Riley Iverson 91f16fd5ed
Correct mailmapping of empty email addresses
`not old_email` doesn't distinguish between `None` and an empty string,
causing old emails specified as `<>` to apply to every single commit.

Signed-off-by: Riley Iverson <blepabyte@proton.me>
2 years ago
Markus Heidelberg 3fe2b5c3c9 filter-repo: prepend the header line to the "ref-map" file
The existance of a header has already been specified in the documentation.
Further adapt it to the real text implemented now.

Signed-off-by: Markus Heidelberg <markus.heidelberg@web.de>
2 years ago
Elijah Newren 0cd8a1fd39 filter-repo: fix blob count when analyzing
Reported-by: Li Linchao
Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
Elijah Newren 933475ecf1 Make it clearer that --path* do not follow renames
The wording "exact paths" appears to not be clear enough for folks and I
keep repeatedly getting bug reports about filter-repo not following
renames.  Make it very explicit.

Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
Elijah Newren 05e3548b67 Merge branch 'rnd/add-report-dir-option'
Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
rndbit e9d5ab3529 filter-repo: add option --report-dir to set custom analysis dir
--analyze is hardcoded to write to a subdirectory inside GIT_DIR.

When practicing filtering runs on a large repo it is desirable to keep
an unchanged copy read-only to reduce chance of user error. It is
desirable to be able to analyze a read-only repo without having to clone
it. This would save a lot of time and space.

Add --report-dir option to set a non-default destination directory for
writing analysis output to.

Signed-off-by: rndbit <rndbit@filter.bitman.net>
[en: fixed existing regression test broken by now not overwriting the
     analysis directory unconditionally, and also added a new test of
     the new behavior for code coverage.]
Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
rndbit 9cfe2b4090 filter-repo: fix detection of binary blobs for --replace-text
Detection if blob is binary for the purpose of --replace-text always
fails and text replacement is applied to all blobs. This has changed
going to python3. With python2 the same code would still be wrong but
would manifest differently.

In the construct 'for x in b"..."' the x is
 - of type <int> in python3
 - of type <str> in python2
thus in python3 condition 'x == b"\0"' can not be true for any x due to
type difference.

Further, the search was supposed to look for NUL byte and not 0
character, thus change to b"\0" instead of b"0".

Signed-off-by: rndbit <rndbit@filter.bitman.net>
3 years ago
Elijah Newren d8e858aeca Merge branch 'sr/fix-file-used-in-version-calculation'
Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
Gwyneth Morgan 129a3bcb8b filter-repo: add new --replace-message option
Like --replace-text, add an option --replace-message which replaces text
in commit/tag message bodies, so that users can easily replace text
without constructing a --message-callback.

Signed-off-by: Gwyneth Morgan <gwymor@tilde.club>
Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
Stefano Rivera e7728c38ae Calculate the version from the module, not the entry_point
When git-filter-repo is installed, sys.argv[0] will be an entry-point
stub, not the relevant Python module.

Signed-off-by: Stefano Rivera <stefano@rivera.za.net>
3 years ago
Benjamin Motz 4ff15cd422 Use setup.py entry_points for installation
This should make the installation via pip more robust.

On Windows the usage of entry_points will install a wrapper executable
for the script that chooses the proper python executable. This
essentially makes the script run correctly when called via `git
filter-repo` (direct execution via `git-filter-repo` was already fine
before).

This fixes an issue on Windows, where the git-installation will choose a
different python executable than the one indicated by the installation
via `pip{x,3} install`.

Signed-off-by: Benjamin Motz <benjamin.motz@mailbox.org>
3 years ago
Elijah Newren 7ceb213f04 filter-repo: ensure we close files so they get written
It appears that python will usually write out files even if we do not
explicitly close them, but other tweaks to the code can make this not
happen.  Explicitly close the files to be safe.

Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
Stefan Lietzau c9a9dcc886 filter-repo: ignore case for email address with mailmap
`git shortlog` ignores the case when matching the email address. As
such, `git filter-repo` should do the same.

Signed-off-by: Stefan Lietzau <lietzaustefan@gmail.com>
[en: fixed a small logic error, tweaked the commit message, and rebased]
Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
Shezan Baig 5256c99e49 Allow callback body to be loaded from a file
For anything more complicated than a few lines, it's easier to write the
callback body in a file and let filter-repo load the file as a string.

Signed-off-by: Shezan Baig <sbaig1@bloomberg.net>
[en: added a testcase for code coverage]
Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
Elijah Newren d2fdc89ff3 filter-repo: avoid depending on `wc` binary being present
rev-list already has --count option anyway, so piping output to wc -l to
count the number of lines was a total waste of time.  Plus, it might
cause failures for the testsuite on some Windows boxes.

Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
Elijah Newren cf67ccd978 filter-repo: improve invalid repository error message
Even though the repository is encoded as a bytestring, we want error
messages to be UTF-8.

Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
Elijah Newren cf84943982 Merge branch 'lk/path-rename-colon-count'
Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
Lassi Kortela 28b479b79d Fix bug in --path-rename argument without colon
The --path-rename flag expected an argument with a colon
character (':') in it, which it assumed without checking. If the user
gave an argument with no colon in it, this backtrace would be shown:

  File "/usr/local/bin/git-filter-repo", line 1626, in __call__
    if values[0] and values[1] and not (
IndexError: list index out of range

Add a real error message in place of the backtrace.

Also check that there's exactly one colon; show an error message if
there's more than one, as that syntax has no interpretation that is
obviously the right one.

Signed-off-by: Lassi Kortela <lassi@lassi.io>
3 years ago
Elijah Newren 4987e0f6e3 filter-repo: fix --use-mailmap
--use-mailmap was defined as `--mailmap .mailmap` except that it would
set args.mailmap to ".mailmap" rather than b".mailmap" (in other words,
it accidentally set it to a string rather than a bytestring).  Since
the --mailmap parameter is always passed as a bytestring, we ran into
errors with calling unknown functions due to the type mismatch.

Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
Elijah Newren 93ee4ae907 Merge branch 'mw/empty-author-name' into main
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Martin Wilck 282f8ddb9b filter-repo: only set author from committer if author email not set
Some commits may have a valid author email, but no valid author name.
Old versions of git didn't enforce a non-empty name.
Setting the author data from the committer is wrong in this case.

Also add a test case for this to t9390.

Example: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c6295cdf656de63d6d1123def71daba6cd91939c

(en: replaced with a dedicated test instead of tweaking existing ones)

Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 7eaaf191de filter-repo: correctly prune nested tags not matching filtering criteria
When the user specifies some kind of criteria to filter commits by (e.g.
--subdirectory-filter mysubdir), we rewrite parents commits that are
entirely filtered out to the most recent ancestor that still exists, or
just prune the parent if there isn't one.  That works great when the
parent is a commit, but nested tags have parents that are tags.  If we
only prune the first tag (i.e. the tag of a commit), then letting any
tags through that had that tag as a parent will result in a fast-import
crash with a message of the form

   fatal: mark :35390 not declared

Ensure that when a tag gets pruned, the pruning is recorded as such...so
that any children tags will get pruned as well.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren d79ea709b7 filter-repo: fix crash from assuming parent is an int
When filtering with --refs, parents can be a hash rather than an
integer.  There was a code path in RepoFilter._prunable() that was
written assuming the first parent would always be an integer; fix it to
handle a hash as well.

Reported-by: Niklas Hambüchen <mail@nh2.me>
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren e4960a53f8 Fix undefined variable names
Reported-by: Christian Clauss <cclauss@me.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren cefeef1c0a filter-repo: use new --date-format=raw-permissive fast-import option
fast-import gained a new raw-permissive date format explictly for
allowing people to import repositories as-is.  Make use of the flag, and
stop rewriting the bogus timezone found in rails.git.

If users do not like these bogus times, they can of course write a
filter to fix them (or even make them bogus in a different way).  For
example:

    git filter-repo ... --commit-callback '
      if commit.author_date.endswith(b"+051800"):
        commit.author_date.replace(b"+051800", b"+0261")
    '

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 1e0c3ab3ae filter-repo: make fresh clone warning scarier
Apparently, despite the fact that *overwrite* *repo* *history* are three
important words that each individually convey a lot of important
meaning, people ignore it and instinctively add --force.  Insert the
word "destructively" to get people to pause.

Further, change the end of the warning not to how to get around the
warning with the current repository, but instead with a suggestion that
they should instead be operating on a fresh clone and only then make a
side comment that the --force flag can be used to override.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 8abf8faec8 git-filter-repo.txt: be more forceful on the wording of --force
Online blogs/articles/Q&A as well as direct feedback suggests that
people use the --force flag rather cavalierly.  Add words like
"irreversible" and "immediate pruning" to discourage such blithe
application of this flag.  I hope this encourages folks to either learn
the ramifications of irreversible full-repository entire history
rewrites first, or to follow the recommendation of only operating on a
fresh clone.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 38e70b69e8 filter-repo: ignore comment lines in --paths-from-file
Allow lines starting with '#' to be treated as a comment and be ignored.
Update the documentation to note that both blank lines and comment lines
are ignored, and mention how filenames starting with '#' can be matched
(namely, the same way that filenames startwith with 'regex:', 'glob:',
or 'literal:' can be -- by prefixing the filename with 'literal:').

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 771404d656 filter-repo: allow globs to match file or directory names
I added special code to filter-repo so that --path expressions could
match filenames or some leading directory name.  --path-regex, since it
does not implicitly add anchorings, can also match a leading path, and
can thus be used to match against directories.  --path-glob could not be
used to match a leading directory of a path, since fnmatch.fnmatch()
requires the full string to match.  But users like being able to specify
directory names, such as '*/bin', so let's take any glob expression and
treat it as two: '<glob>' and '<glob>/*' and try to match against either
one; this will allow it to match against file or directory names like
the other two types of path matching.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 49d6f02ff8 filter-repo: clarify interactions between path filtering and path renaming
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 3e1bff264c Revert "filter-repo: fix ugly bug with mixing path filtering and renaming"
This reverts commit df6c8652a2.  The
motivating example was wrong; path renaming should not be involved in
path filtering, it only says how paths should be renamed if they happen
to be selected.  A subsequent commit will improve the documentation.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren caa05d15b4 filter-repo: make default replacement text a variable
Allow external scripts that import git-filter-repo to change the value
of the default replacement text instead of having it hardcoded within
some function.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 31f00a9ff8 filter-repo: avoid applying --replace-text to binary files
--replace-text is meant to replace _text_ throughout the repository, not
binary data.  Use the same scheme as the lint-history script uses to
avoid applying the changes to binary blob data.

Reported-by: Tobias Gruetzmacher <tobias-git@23.gs>
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren cdb7b77f07 filter-repo: repack with --source or --target
When using --source or --target in combination with filtering paths,
users were surprised out how large the resulting repository was.  The
usage of --source and --target were turning off repacking; while we
don't want repacking for partial history rewrites and --source and
--target turn on some of the other features we want with partial history
rewrites, repacking is something that we still want turned on.

Reported-by: Alexey Volkov <alexey.volkov@ark-kun.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 7b18e6d7f5 filter-repo: fix --prune-degenerate=never with path filtering
When combining `--prune-degenerate never` with a `--path` specification,
we could end up trying to write a parent out to the fast-import stream
whose value was actually None.  The problem occurs when the parents of
a merge commit are filtered out by the path specification, leaving us
only with no-longer-extant parents.  In such a case, we need to filter
out these 'None' (i.e. invalid) parents.  The point of
`--prune-degenerate never` is to avoid removing parents that are either
the same as or an ancestor of another parent, not to avoid removing
non-existent parents.  Remove the non-existent parent(s).

Reported-by: Gaurav Kanoongo (@gauravkanoongo on GitHub)
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren df6c8652a2 filter-repo: fix ugly bug with mixing path filtering and renaming
There's also a fix in here to make sure to throw an error if users are
trying to rename paths and use --invert-paths; it's not clear at all
what that would even mean.  But that also becomes important later...

Due to the ability to either filter wanted paths (default), or to just
specify unwanted paths (with --invert-paths), I keep a special
args.inclusive variable to track whether a "match" means we want the
path or not.  There are some special cases, notably when there are no
filters present (meaning e.g. no --path specifications, at most there
are some --path-rename values provided).  When there are no filters
present, that means we should keep paths even if we don't "find a match"
against any of the filters.

Now, since the rename code was embedded in the same loop as the filter
checks, it unfortunately was also being checked against the
args.inclusive setting despite never setting whether it found a match.
That happened to work in the special case that there were no filtering
paths but only because of the special logic for that case.  Since
renaming only makes sense if --invert-paths is not specified, any path
we rename is one we always want to keep.  Make sure we do.

Reported-by: Nadège (@nagreme on GitHub)
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 0375758806 filter-repo: fix possible deadlock in sanity_check_args
I'm a little surprised that stdout buffers must have filled up on MacOS X, but
either way we don't have to wait for the '-h' processes to finish before
attempting to read stdout.  In fact, since we weren't storing the returncode
attribute from calling p.wait(), there wasn't much point in doing so.  Trying
to read all stdout all at once is going to implicitly take until the process
finishes anyway, so just do that.

Reported-by: Benoit Lefèvre <contact@benoit-lefevre.org>
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 15494bba8a filter-repo: make git version requirement error message more direct
Users won't know which versions of git have --mark-tags, --reencode, or
--combined-all-paths options for fast-export and diff-tree.  I didn't
either when I wrote those messages because it wasn't in a released
version of git.  Now that they are in released versions and have been
for a while, we can simplify the messages to just state which git
version is needed.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 3dfaf3874e filter-repo: fix --no-local error when there is no remote
Commit 011c646ee8 (filter-repo: suggest --no-local when cloning local
repos, 2020-05-15) added an additional message to the error to make it
more clear what to do when cloning local repos.  However, if there was
no remote, then the code path would run os.path.isdir(None), triggering
a traceback.  Fix the logic.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 4cfc765eb1 filter-repo: allow removing .git directories from history
Commit 7cfef09e9b (filter-repo: warn users who try to use invalid path
components, 2019-12-26) attempt to protect against using invalid path
components, but also added a check against a path that has sometimes
been valid in the past and which users might want to be able to remove
from their history.  Relax the check so that users can remove '.git'
directories in subdirectories (or even at the toplevel) from their
history.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 2833ef275f filter-repo: throw an error if user specifies any path starting with a slash
All paths are intended to be relative paths, relative to the project
root, not to the filesystem root.  There have been a few people who
didn't understand this, and then ended up with fast-import crashes that
are not very clear.  Check for it early and throw a simple error message
instead.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren e834379254 filter-repo: clarify usage of --use-base-name
fast-export/fast-import only work with filenames (using full path from
the root of the repository); thus that's all that filter-repo works
with.  Full pathnames implicitly include all leading directories as part
of the pathname, which is what allows us to match against directories.
However, it obviously means --use-base-name can't be used to match paths
against directories.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 7c877cd750 filter-repo: make --version more robust against modified shebangs
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren e9c2d9adb5 filter-repo: ensure we write final newline after final progress update
We try to write 'Parsed %d commits' messages only after enough time has
past to avoid writing to stdout becoming a bottleneck.  However, there
was a slight logic error that would cause it to only print the final
newline if there was a new message since the last progress update,
leaving a small race condition where we might miss it.

Reported-by: Valentyn Shtronda (@valiko-ua)
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 011c646ee8 filter-repo: suggest --no-local when cloning local repos
Cloning local repos by default makes a bunch of hardlinks, giving you a
non-packed repository, and leading folks to use and suggest --force.
That, of course, bypasses the important fresh clone checks to prevent
people from accidentally and irrecoverably deleting their non-backed-up
data.  Let's make it easier for people to avoid (and suggest) that
mistake.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren c0c37a7656 filter-repo: fix bitrotted documentation links
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren e11343e504 filter-repo: handle typechange modifications when first parent is pruned
Commit 509a624b (filter-repo: fix issue with pruning of empty commits,
2019-10-03) added code to get a new list of file changes when the first
parent was pruned.  However, this logic did not handle cases where one
of the file modifications was a typechange.  Add the necessary logic to
handle that case.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 4f84a74ada filter-repo: use more expensive prunability checks when needed
When users are inserting new objects into the stream, we cannot make as
many assumptions and need to do more careful checks for whether commits
become empty or not.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago