forked from git/git
-
Notifications
You must be signed in to change notification settings - Fork 185
Batch prefetching #2089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
newren
wants to merge
3
commits into
gitgitgadget:master
Choose a base branch
from
newren:batch-prefetching
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+321
−1
Open
Batch prefetching #2089
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,10 +21,12 @@ | |
| #include "color.h" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Phillip Wood wrote on the Git mailing list (how to reply to this email): Hi Elijah
On 18/04/2026 01:32, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> > In partial clones, `git cherry` fetches necessary blobs on-demand one
> at a time, which can be very slow. We would like to prefetch all
> necessary blobs upfront. To do so, we need to be able to first figure
> out which blobs are needed.
"git rebase" without "--reapply-cherry-picks" suffers from this problem as well as it does the equivalent of "git log --cherry-pick". Is there any way to share prefetch_cherry_blobs() with the cherry-pick detection in revision.c?
Thanks
Phillip
> `git cherry` does its work in a two-phase approach: first computing
> header-only IDs (based on file paths and modes), then falling back to
> full content-based IDs only when header-only IDs collide -- or, more
> accurately, whenever the oidhash() of the header-only object_ids
> collide.
> > patch-ids.c handles this by creating an ids->patches hashmap that has
> all the data we need, but the problem is that any attempt to query the
> hashmap will invoke the patch_id_neq() function on any colliding objects,
> which causes the on-demand fetching.
> > Insert a new prefetch_cherry_blobs() function before checking for
> collisions. Use a temporary replacement on the ids->patches.cmpfn
> in order to enumerate the blobs that would be needed without yet
> fetching them, and then fetch them all at once, then restore the old
> ids->patches.cmpfn.
> > Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
> builtin/log.c | 125 ++++++++++++++++++++++++++++++++++++++++++++++
> t/t3500-cherry.sh | 18 +++++++
> 2 files changed, 143 insertions(+)
> > diff --git a/builtin/log.c b/builtin/log.c
> index 8c0939dd42..df19876be6 100644
> --- a/builtin/log.c
> +++ b/builtin/log.c
> @@ -21,10 +21,12 @@
> #include "color.h"
> #include "commit.h"
> #include "diff.h"
> +#include "diffcore.h"
> #include "diff-merges.h"
> #include "revision.h"
> #include "log-tree.h"
> #include "oid-array.h"
> +#include "oidset.h"
> #include "tag.h"
> #include "reflog-walk.h"
> #include "patch-ids.h"
> @@ -43,9 +45,11 @@
> #include "utf8.h"
> > #include "commit-reach.h"
> +#include "promisor-remote.h"
> #include "range-diff.h"
> #include "tmp-objdir.h"
> #include "tree.h"
> +#include "userdiff.h"
> #include "write-or-die.h"
> > #define MAIL_DEFAULT_WRAP 72
> @@ -2602,6 +2606,125 @@ static void print_commit(char sign, struct commit *commit, int verbose,
> }
> }
> > +/*
> + * Enumerate blob OIDs from a single commit's diff, inserting them into blobs.
> + * Skips files whose userdiff driver explicitly declares binary status
> + * (drv->binary > 0), since patch-ID uses oid_to_hex() for those and
> + * never reads blob content. Use userdiff_find_by_path() since
> + * diff_filespec_load_driver() is static in diff.c.
> + *
> + * Clean up with diff_queue_clear() (from diffcore.h).
> + */
> +static void collect_diff_blob_oids(struct commit *commit,
> + struct diff_options *opts,
> + struct oidset *blobs)
> +{
> + struct diff_queue_struct *q;
> +
> + /*
> + * Merge commits are filtered out by patch_id_defined() in patch-ids.c,
> + * so we'll never be called with one.
> + */
> + assert(!commit->parents || !commit->parents->next);
> +
> + if (commit->parents)
> + diff_tree_oid(&commit->parents->item->object.oid,
> + &commit->object.oid, "", opts);
> + else
> + diff_root_tree_oid(&commit->object.oid, "", opts);
> + diffcore_std(opts);
> +
> + q = &diff_queued_diff;
> + for (int i = 0; i < q->nr; i++) {
> + struct diff_filepair *p = q->queue[i];
> + struct userdiff_driver *drv;
> +
> + /* Skip binary files */
> + drv = userdiff_find_by_path(opts->repo->index, p->one->path);
> + if (drv && drv->binary > 0)
> + continue;
> +
> + if (DIFF_FILE_VALID(p->one))
> + oidset_insert(blobs, &p->one->oid);
> + if (DIFF_FILE_VALID(p->two))
> + oidset_insert(blobs, &p->two->oid);
> + }
> + diff_queue_clear(q);
> +}
> +
> +static int always_match(const void *cmp_data UNUSED,
> + const struct hashmap_entry *entry1 UNUSED,
> + const struct hashmap_entry *entry2 UNUSED,
> + const void *keydata UNUSED)
> +{
> + return 0;
> +}
> +
> +/*
> + * Prefetch blobs for git cherry in partial clones.
> + *
> + * Called between the revision walk (which builds the head-side
> + * commit list) and the has_commit_patch_id() comparison loop.
> + *
> + * Uses a cmpfn-swap trick to avoid reading blobs: temporarily
> + * replaces the hashmap's comparison function with a trivial
> + * always-match function, so hashmap_get()/hashmap_get_next() match
> + * any entry with the same oidhash bucket. These are the set of oids
> + * that would trigger patch_id_neq() during normal lookup and cause
> + * blobs to be read on demand, and we want to prefetch them all at
> + * once instead.
> + */
> +static void prefetch_cherry_blobs(struct repository *repo,
> + struct commit_list *list,
> + struct patch_ids *ids)
> +{
> + struct oidset blobs = OIDSET_INIT;
> + hashmap_cmp_fn original_cmpfn;
> +
> + /* Exit if we're not in a partial clone */
> + if (!repo_has_promisor_remote(repo))
> + return;
> +
> + /* Save original cmpfn, replace with always_match */
> + original_cmpfn = ids->patches.cmpfn;
> + ids->patches.cmpfn = always_match;
> +
> + /* Find header-only collisions, gather blobs from those commits */
> + for (struct commit_list *l = list; l; l = l->next) {
> + struct commit *c = l->item;
> + bool match_found = false;
> + for (struct patch_id *cur = patch_id_iter_first(c, ids);
> + cur;
> + cur = patch_id_iter_next(cur, ids)) {
> + match_found = true;
> + collect_diff_blob_oids(cur->commit, &ids->diffopts,
> + &blobs);
> + }
> + if (match_found)
> + collect_diff_blob_oids(c, &ids->diffopts, &blobs);
> + }
> +
> + /* Restore original cmpfn */
> + ids->patches.cmpfn = original_cmpfn;
> +
> + /* If we have any blobs to fetch, fetch them */
> + if (oidset_size(&blobs)) {
> + struct oid_array to_fetch = OID_ARRAY_INIT;
> + struct oidset_iter iter;
> + const struct object_id *oid;
> +
> + oidset_iter_init(&blobs, &iter);
> + while ((oid = oidset_iter_next(&iter)))
> + oid_array_append(&to_fetch, oid);
> +
> + promisor_remote_get_direct(repo, to_fetch.oid, to_fetch.nr);
> +
> + oid_array_clear(&to_fetch);
> + }
> +
> + oidset_clear(&blobs);
> +}
> +
> int cmd_cherry(int argc,
> const char **argv,
> const char *prefix,
> @@ -2673,6 +2796,8 @@ int cmd_cherry(int argc,
> commit_list_insert(commit, &list);
> }
> > + prefetch_cherry_blobs(the_repository, list, &ids);
> +
> for (struct commit_list *l = list; l; l = l->next) {
> char sign = '+';
> > diff --git a/t/t3500-cherry.sh b/t/t3500-cherry.sh
> index 78c3eac54b..17507d9a28 100755
> --- a/t/t3500-cherry.sh
> +++ b/t/t3500-cherry.sh
> @@ -78,4 +78,22 @@ test_expect_success 'cherry ignores whitespace' '
> test_cmp expect actual
> '
> > +# Reuse the expect file from the previous test, in a partial clone
> +test_expect_success 'cherry in partial clone does bulk prefetch' '
> + test_config uploadpack.allowfilter 1 &&
> + test_config uploadpack.allowanysha1inwant 1 &&
> + test_when_finished "rm -rf copy" &&
> +
> + git clone --bare --filter=blob:none file://"$(pwd)" copy &&
> + (
> + cd copy &&
> + GIT_TRACE2_EVENT="$(pwd)/trace.output" git cherry upstream-with-space feature-without-space >actual &&
> + test_cmp ../expect actual &&
> +
> + grep "child_start.*fetch.negotiationAlgorithm" trace.output >fetches &&
> + test_line_count = 1 fetches &&
> + test_trace2_data promisor fetch_count 4 <trace.output
> + )
> +'
> +
> test_done |
||
| #include "commit.h" | ||
| #include "diff.h" | ||
| #include "diffcore.h" | ||
| #include "diff-merges.h" | ||
| #include "revision.h" | ||
| #include "log-tree.h" | ||
| #include "oid-array.h" | ||
| #include "oidset.h" | ||
| #include "tag.h" | ||
| #include "reflog-walk.h" | ||
| #include "patch-ids.h" | ||
|
|
@@ -43,9 +45,11 @@ | |
| #include "utf8.h" | ||
|
|
||
| #include "commit-reach.h" | ||
| #include "promisor-remote.h" | ||
| #include "range-diff.h" | ||
| #include "tmp-objdir.h" | ||
| #include "tree.h" | ||
| #include "userdiff.h" | ||
| #include "write-or-die.h" | ||
|
|
||
| #define MAIL_DEFAULT_WRAP 72 | ||
|
|
@@ -2602,6 +2606,125 @@ static void print_commit(char sign, struct commit *commit, int verbose, | |
| } | ||
| } | ||
|
|
||
| /* | ||
| * Enumerate blob OIDs from a single commit's diff, inserting them into blobs. | ||
| * Skips files whose userdiff driver explicitly declares binary status | ||
| * (drv->binary > 0), since patch-ID uses oid_to_hex() for those and | ||
| * never reads blob content. Use userdiff_find_by_path() since | ||
| * diff_filespec_load_driver() is static in diff.c. | ||
| * | ||
| * Clean up with diff_queue_clear() (from diffcore.h). | ||
| */ | ||
| static void collect_diff_blob_oids(struct commit *commit, | ||
| struct diff_options *opts, | ||
| struct oidset *blobs) | ||
| { | ||
| struct diff_queue_struct *q; | ||
|
|
||
| /* | ||
| * Merge commits are filtered out by patch_id_defined() in patch-ids.c, | ||
| * so we'll never be called with one. | ||
| */ | ||
| assert(!commit->parents || !commit->parents->next); | ||
|
|
||
| if (commit->parents) | ||
| diff_tree_oid(&commit->parents->item->object.oid, | ||
| &commit->object.oid, "", opts); | ||
| else | ||
| diff_root_tree_oid(&commit->object.oid, "", opts); | ||
| diffcore_std(opts); | ||
|
|
||
| q = &diff_queued_diff; | ||
| for (int i = 0; i < q->nr; i++) { | ||
| struct diff_filepair *p = q->queue[i]; | ||
| struct userdiff_driver *drv; | ||
|
|
||
| /* Skip binary files */ | ||
| drv = userdiff_find_by_path(opts->repo->index, p->one->path); | ||
| if (drv && drv->binary > 0) | ||
| continue; | ||
|
|
||
| if (DIFF_FILE_VALID(p->one)) | ||
| oidset_insert(blobs, &p->one->oid); | ||
| if (DIFF_FILE_VALID(p->two)) | ||
| oidset_insert(blobs, &p->two->oid); | ||
| } | ||
| diff_queue_clear(q); | ||
| } | ||
|
|
||
| static int always_match(const void *cmp_data UNUSED, | ||
| const struct hashmap_entry *entry1 UNUSED, | ||
| const struct hashmap_entry *entry2 UNUSED, | ||
| const void *keydata UNUSED) | ||
| { | ||
| return 0; | ||
| } | ||
|
|
||
| /* | ||
| * Prefetch blobs for git cherry in partial clones. | ||
| * | ||
| * Called between the revision walk (which builds the head-side | ||
| * commit list) and the has_commit_patch_id() comparison loop. | ||
| * | ||
| * Uses a cmpfn-swap trick to avoid reading blobs: temporarily | ||
| * replaces the hashmap's comparison function with a trivial | ||
| * always-match function, so hashmap_get()/hashmap_get_next() match | ||
| * any entry with the same oidhash bucket. These are the set of oids | ||
| * that would trigger patch_id_neq() during normal lookup and cause | ||
| * blobs to be read on demand, and we want to prefetch them all at | ||
| * once instead. | ||
| */ | ||
| static void prefetch_cherry_blobs(struct repository *repo, | ||
| struct commit_list *list, | ||
| struct patch_ids *ids) | ||
| { | ||
| struct oidset blobs = OIDSET_INIT; | ||
| hashmap_cmp_fn original_cmpfn; | ||
|
|
||
| /* Exit if we're not in a partial clone */ | ||
| if (!repo_has_promisor_remote(repo)) | ||
| return; | ||
|
|
||
| /* Save original cmpfn, replace with always_match */ | ||
| original_cmpfn = ids->patches.cmpfn; | ||
| ids->patches.cmpfn = always_match; | ||
|
|
||
| /* Find header-only collisions, gather blobs from those commits */ | ||
| for (struct commit_list *l = list; l; l = l->next) { | ||
| struct commit *c = l->item; | ||
| bool match_found = false; | ||
| for (struct patch_id *cur = patch_id_iter_first(c, ids); | ||
| cur; | ||
| cur = patch_id_iter_next(cur, ids)) { | ||
| match_found = true; | ||
| collect_diff_blob_oids(cur->commit, &ids->diffopts, | ||
| &blobs); | ||
| } | ||
| if (match_found) | ||
| collect_diff_blob_oids(c, &ids->diffopts, &blobs); | ||
| } | ||
|
|
||
| /* Restore original cmpfn */ | ||
| ids->patches.cmpfn = original_cmpfn; | ||
|
|
||
| /* If we have any blobs to fetch, fetch them */ | ||
| if (oidset_size(&blobs)) { | ||
| struct oid_array to_fetch = OID_ARRAY_INIT; | ||
| struct oidset_iter iter; | ||
| const struct object_id *oid; | ||
|
|
||
| oidset_iter_init(&blobs, &iter); | ||
| while ((oid = oidset_iter_next(&iter))) | ||
| oid_array_append(&to_fetch, oid); | ||
|
|
||
| promisor_remote_get_direct(repo, to_fetch.oid, to_fetch.nr); | ||
|
|
||
| oid_array_clear(&to_fetch); | ||
| } | ||
|
|
||
| oidset_clear(&blobs); | ||
| } | ||
|
|
||
| int cmd_cherry(int argc, | ||
| const char **argv, | ||
| const char *prefix, | ||
|
|
@@ -2673,6 +2796,8 @@ int cmd_cherry(int argc, | |
| commit_list_insert(commit, &list); | ||
| } | ||
|
|
||
| prefetch_cherry_blobs(the_repository, list, &ids); | ||
|
|
||
| for (struct commit_list *l = list; l; l = l->next) { | ||
| char sign = '+'; | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.