Batch prefetching by newren · Pull Request #2089 · gitgitgadget/git

newren · 2026-04-15T19:20:52Z

Changes since v1:

Remove stray file that should have never been added. So embarrassing that I didn't catch that before submitting.

Partial clones provide a trade-off for users: avoid downloading blobs upfront, at the expense of needing to download them later as they run other commands. This tradeoff can sometimes incur a more severe cost than expected, particularly if needed blobs are discovered as they are accessed, resulting in downloading blobs one at a time. Some commands like checkout, diff, and merge do batch prefetches of necessary blobs, since that can dramatically reduce the pain of on-demand loading. Extend this ability to two more commands: cherry and grep.

This series was spurred by a report where git cherry jobs were each doing hundreds of single-blob fetches, at a cost of 3s each. Batching those downloads should dramatically speed up their jobs. (And I decided to fix up git grep similarly while at it.)

I'll also note that git backfill with revisions and/or pathspecs could also improve things for these users, but since backfill is a manual command users would have to run and requires users to try to figure out which data is needed (a challenge in the case of cherry), it still makes sense to provide smarter behavior for folks who don't choose to manually run backfill.

Also, correct a documentation typo I noticed in patch-ids.h (related to code I was using for the git cherry fixes) as a preparatory fixup.

cc: Elijah Newren newren@gmail.com
cc: Phillip Wood phillip.wood123@gmail.com

newren · 2026-04-16T22:47:13Z

/submit

gitgitgadget · 2026-04-16T22:48:43Z

Submitted as pull.2089.git.1776379694.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2089/newren/batch-prefetching-v1

To fetch this version to local tag pr-2089/newren/batch-prefetching-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2089/newren/batch-prefetching-v1

Signed-off-by: Elijah Newren <newren@gmail.com>

In partial clones, `git cherry` fetches necessary blobs on-demand one at a time, which can be very slow. We would like to prefetch all necessary blobs upfront. To do so, we need to be able to first figure out which blobs are needed. `git cherry` does its work in a two-phase approach: first computing header-only IDs (based on file paths and modes), then falling back to full content-based IDs only when header-only IDs collide -- or, more accurately, whenever the oidhash() of the header-only object_ids collide. patch-ids.c handles this by creating an ids->patches hashmap that has all the data we need, but the problem is that any attempt to query the hashmap will invoke the patch_id_neq() function on any colliding objects, which causes the on-demand fetching. Insert a new prefetch_cherry_blobs() function before checking for collisions. Use a temporary replacement on the ids->patches.cmpfn in order to enumerate the blobs that would be needed without yet fetching them, and then fetch them all at once, then restore the old ids->patches.cmpfn. Signed-off-by: Elijah Newren <newren@gmail.com>

In partial clones, `git grep` fetches necessary blobs on-demand one at a time, which can be very slow. In partial clones, add an extra preliminary walk over the tree similar to grep_tree() which collects the blobs of interest, and then prefetches them. Signed-off-by: Elijah Newren <newren@gmail.com>

gitgitgadget · 2026-04-17T22:39:52Z

User Elijah Newren <newren@gmail.com> has been added to the cc: list.

newren · 2026-04-18T00:31:27Z

/submit

gitgitgadget · 2026-04-18T00:32:56Z

Submitted as pull.2089.v2.git.1776472347.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2089/newren/batch-prefetching-v2

To fetch this version to local tag pr-2089/newren/batch-prefetching-v2:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2089/newren/batch-prefetching-v2

gitgitgadget · 2026-04-18T05:18:07Z

This branch is now known as en/batch-prefetch.

gitgitgadget · 2026-04-19T14:26:38Z

@@ -21,10 +21,12 @@
 #include "color.h"


Phillip Wood wrote on the Git mailing list (how to reply to this email):

Hi Elijah On 18/04/2026 01:32, Elijah Newren via GitGitGadget wrote: > From: Elijah Newren <newren@gmail.com> > > In partial clones, `git cherry` fetches necessary blobs on-demand one > at a time, which can be very slow. We would like to prefetch all > necessary blobs upfront. To do so, we need to be able to first figure > out which blobs are needed. "git rebase" without "--reapply-cherry-picks" suffers from this problem as well as it does the equivalent of "git log --cherry-pick". Is there any way to share prefetch_cherry_blobs() with the cherry-pick detection in revision.c? Thanks Phillip > `git cherry` does its work in a two-phase approach: first computing > header-only IDs (based on file paths and modes), then falling back to > full content-based IDs only when header-only IDs collide -- or, more > accurately, whenever the oidhash() of the header-only object_ids > collide. > > patch-ids.c handles this by creating an ids->patches hashmap that has > all the data we need, but the problem is that any attempt to query the > hashmap will invoke the patch_id_neq() function on any colliding objects, > which causes the on-demand fetching. > > Insert a new prefetch_cherry_blobs() function before checking for > collisions. Use a temporary replacement on the ids->patches.cmpfn > in order to enumerate the blobs that would be needed without yet > fetching them, and then fetch them all at once, then restore the old > ids->patches.cmpfn. > > Signed-off-by: Elijah Newren <newren@gmail.com> > --- > builtin/log.c | 125 ++++++++++++++++++++++++++++++++++++++++++++++ > t/t3500-cherry.sh | 18 +++++++ > 2 files changed, 143 insertions(+) > > diff --git a/builtin/log.c b/builtin/log.c > index 8c0939dd42..df19876be6 100644 > --- a/builtin/log.c > +++ b/builtin/log.c > @@ -21,10 +21,12 @@ > #include "color.h" > #include "commit.h" > #include "diff.h" > +#include "diffcore.h" > #include "diff-merges.h" > #include "revision.h" > #include "log-tree.h" > #include "oid-array.h" > +#include "oidset.h" > #include "tag.h" > #include "reflog-walk.h" > #include "patch-ids.h" > @@ -43,9 +45,11 @@ > #include "utf8.h" > > #include "commit-reach.h" > +#include "promisor-remote.h" > #include "range-diff.h" > #include "tmp-objdir.h" > #include "tree.h" > +#include "userdiff.h" > #include "write-or-die.h" > > #define MAIL_DEFAULT_WRAP 72 > @@ -2602,6 +2606,125 @@ static void print_commit(char sign, struct commit *commit, int verbose, > } > } > > +/* > + * Enumerate blob OIDs from a single commit's diff, inserting them into blobs. > + * Skips files whose userdiff driver explicitly declares binary status > + * (drv->binary > 0), since patch-ID uses oid_to_hex() for those and > + * never reads blob content. Use userdiff_find_by_path() since > + * diff_filespec_load_driver() is static in diff.c. > + * > + * Clean up with diff_queue_clear() (from diffcore.h). > + */ > +static void collect_diff_blob_oids(struct commit *commit, > + struct diff_options *opts, > + struct oidset *blobs) > +{ > + struct diff_queue_struct *q; > + > + /* > + * Merge commits are filtered out by patch_id_defined() in patch-ids.c, > + * so we'll never be called with one. > + */ > + assert(!commit->parents || !commit->parents->next); > + > + if (commit->parents) > + diff_tree_oid(&commit->parents->item->object.oid, > + &commit->object.oid, "", opts); > + else > + diff_root_tree_oid(&commit->object.oid, "", opts); > + diffcore_std(opts); > + > + q = &diff_queued_diff; > + for (int i = 0; i < q->nr; i++) { > + struct diff_filepair *p = q->queue[i]; > + struct userdiff_driver *drv; > + > + /* Skip binary files */ > + drv = userdiff_find_by_path(opts->repo->index, p->one->path); > + if (drv && drv->binary > 0) > + continue; > + > + if (DIFF_FILE_VALID(p->one)) > + oidset_insert(blobs, &p->one->oid); > + if (DIFF_FILE_VALID(p->two)) > + oidset_insert(blobs, &p->two->oid); > + } > + diff_queue_clear(q); > +} > + > +static int always_match(const void *cmp_data UNUSED, > + const struct hashmap_entry *entry1 UNUSED, > + const struct hashmap_entry *entry2 UNUSED, > + const void *keydata UNUSED) > +{ > + return 0; > +} > + > +/* > + * Prefetch blobs for git cherry in partial clones. > + * > + * Called between the revision walk (which builds the head-side > + * commit list) and the has_commit_patch_id() comparison loop. > + * > + * Uses a cmpfn-swap trick to avoid reading blobs: temporarily > + * replaces the hashmap's comparison function with a trivial > + * always-match function, so hashmap_get()/hashmap_get_next() match > + * any entry with the same oidhash bucket. These are the set of oids > + * that would trigger patch_id_neq() during normal lookup and cause > + * blobs to be read on demand, and we want to prefetch them all at > + * once instead. > + */ > +static void prefetch_cherry_blobs(struct repository *repo, > + struct commit_list *list, > + struct patch_ids *ids) > +{ > + struct oidset blobs = OIDSET_INIT; > + hashmap_cmp_fn original_cmpfn; > + > + /* Exit if we're not in a partial clone */ > + if (!repo_has_promisor_remote(repo)) > + return; > + > + /* Save original cmpfn, replace with always_match */ > + original_cmpfn = ids->patches.cmpfn; > + ids->patches.cmpfn = always_match; > + > + /* Find header-only collisions, gather blobs from those commits */ > + for (struct commit_list *l = list; l; l = l->next) { > + struct commit *c = l->item; > + bool match_found = false; > + for (struct patch_id *cur = patch_id_iter_first(c, ids); > + cur; > + cur = patch_id_iter_next(cur, ids)) { > + match_found = true; > + collect_diff_blob_oids(cur->commit, &ids->diffopts, > + &blobs); > + } > + if (match_found) > + collect_diff_blob_oids(c, &ids->diffopts, &blobs); > + } > + > + /* Restore original cmpfn */ > + ids->patches.cmpfn = original_cmpfn; > + > + /* If we have any blobs to fetch, fetch them */ > + if (oidset_size(&blobs)) { > + struct oid_array to_fetch = OID_ARRAY_INIT; > + struct oidset_iter iter; > + const struct object_id *oid; > + > + oidset_iter_init(&blobs, &iter); > + while ((oid = oidset_iter_next(&iter))) > + oid_array_append(&to_fetch, oid); > + > + promisor_remote_get_direct(repo, to_fetch.oid, to_fetch.nr); > + > + oid_array_clear(&to_fetch); > + } > + > + oidset_clear(&blobs); > +} > + > int cmd_cherry(int argc, > const char **argv, > const char *prefix, > @@ -2673,6 +2796,8 @@ int cmd_cherry(int argc, > commit_list_insert(commit, &list); > } > > + prefetch_cherry_blobs(the_repository, list, &ids); > + > for (struct commit_list *l = list; l; l = l->next) { > char sign = '+'; > > diff --git a/t/t3500-cherry.sh b/t/t3500-cherry.sh > index 78c3eac54b..17507d9a28 100755 > --- a/t/t3500-cherry.sh > +++ b/t/t3500-cherry.sh > @@ -78,4 +78,22 @@ test_expect_success 'cherry ignores whitespace' ' > test_cmp expect actual > ' > > +# Reuse the expect file from the previous test, in a partial clone > +test_expect_success 'cherry in partial clone does bulk prefetch' ' > + test_config uploadpack.allowfilter 1 && > + test_config uploadpack.allowanysha1inwant 1 && > + test_when_finished "rm -rf copy" && > + > + git clone --bare --filter=blob:none file://"$(pwd)" copy && > + ( > + cd copy && > + GIT_TRACE2_EVENT="$(pwd)/trace.output" git cherry upstream-with-space feature-without-space >actual && > + test_cmp ../expect actual && > + > + grep "child_start.*fetch.negotiationAlgorithm" trace.output >fetches && > + test_line_count = 1 fetches && > + test_trace2_data promisor fetch_count 4 <trace.output > + ) > +' > + > test_done

gitgitgadget · 2026-04-19T14:26:39Z

User Phillip Wood <phillip.wood123@gmail.com> has been added to the cc: list.

gitgitgadget bot reviewed Apr 17, 2026

View reviewed changes

Comment thread builtin/log.c

newren added 3 commits April 17, 2026 15:04

patch-ids.h: add missing trailing parenthesis in documentation comment

663816a

Signed-off-by: Elijah Newren <newren@gmail.com>

newren force-pushed the batch-prefetching branch from 6dbfc76 to 8fbfe69 Compare April 17, 2026 22:05

gitgitgadget bot reviewed Apr 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch prefetching#2089

Batch prefetching#2089
newren wants to merge 3 commits intogitgitgadget:masterfrom
newren:batch-prefetching

newren commented Apr 15, 2026 •

edited by gitgitgadget bot

Loading

Uh oh!

newren commented Apr 16, 2026

Uh oh!

gitgitgadget bot commented Apr 16, 2026

Uh oh!

Uh oh!

gitgitgadget bot commented Apr 17, 2026

Uh oh!

newren commented Apr 18, 2026

Uh oh!

gitgitgadget bot commented Apr 18, 2026

Uh oh!

gitgitgadget bot commented Apr 18, 2026

Uh oh!

gitgitgadget bot Apr 19, 2026

Uh oh!

gitgitgadget bot commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

newren commented Apr 15, 2026 • edited by gitgitgadget bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

newren commented Apr 16, 2026

Uh oh!

gitgitgadget bot commented Apr 16, 2026

Uh oh!

Uh oh!

gitgitgadget bot commented Apr 17, 2026

Uh oh!

newren commented Apr 18, 2026

Uh oh!

gitgitgadget bot commented Apr 18, 2026

Uh oh!

gitgitgadget bot commented Apr 18, 2026

Uh oh!

gitgitgadget bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

newren commented Apr 15, 2026 •

edited by gitgitgadget bot

Loading