When running the garbage collection for a repository it is often
interesting to compare the repository statistics from before and after
the garbage collection to understand the effect of the garbage
collection. This is why it makes sense that the
GarbageCollectionCommand provides a method to retrieve the repository
statistics before running the garbage collection.
So far without running the garbage collection the repository statistics
can only be retrieved by using JGit internal classes. This is what EGit
and Gerrit do at the moment, but it would be better to have an API for
this.
Change-Id: Id7e579157e9fbef5cfd1fc9f97ada45f0ca8c379
Signed-off-by: Edwin Kempin <edwin.kempin@sap.com>
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Only on Windows the rename operation which renames temporary Packfiles
(and index-files and bitmap-files) sometime fails. This happens only
when renaming a temporary Packfile to a Packfile which already exists.
Such situations occur if you run GC twice on a repo without modifying
the repo inbetween.
In such situations there was bug in GC which led to a corrupted repo
whithout any packfiles anymore. This commit fixes the problem by
introducing a utility method which renames a file and throws an
IOException if it fails. This method also takes care to repeat a
failing rename if our FS class has found out we are running on a
platform with a unreliable File.renameTo() method.
I am searching for a better solution because even with this utility
method in hand a GC on a already GC'ed repo will fail on Windows. But
at least with this fix we will not produce corrupted repos anymore.
Bug: 389305
Change-Id: Iac1ab3e0b8c419c90404f2e2f3559672eb8f6d28
Signed-off-by: Christian Halstrick <christian.halstrick@sap.com>
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
With JGit it is possible to write reflog entries where new objectid and
old objectid is null. Such reflogs cause FileRepository GC to crash
because it doesn't expect the new objectid to be null. One case where
this happened is in Gerrit's allProjects repo. In the same way as we
expect the old objectid to be potentially null we should also ignore
null values in the new objectid column.
Change-Id: Icf666c7ef803179b84306ca8deb602369b8df16e
This breaks all existing callers once. Applications are not supposed
to build against the internal storage API unless they can accept API
churn and make necessary updates as versions change.
Change-Id: I2ab1327c202ef2003565e1b0770a583970e432e9
* changes:
Remove cached_packs support in favor of bitmaps
Remove objects before optimization from DfsGarbageCollector
Simplfy caching of DfsPackDescription from PackWriter.Statistics
As noticed by Robin Rosenberg in review of
I4eb87c850078ca187b38b81cc91c92afb1176945.
Change-Id: If96d66b6c025ad8f2f47829c933f3c65ab6cbeef
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Continuing is trickier, as .git/rebase-apply contains no message file
and no git-rebase-todo.
Bug: 336820
Change-Id: I4eb87c850078ca187b38b81cc91c92afb1176945
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Treat first parent traversals as 1 and higher parents as MERGE_COST,
to match git name-rev. Allow overriding the merge cost during tests to
avoid creating 2^16 commits on the fly.
Change-Id: I0175e0c3ab1abe6722e4241abe2f106d1fe92a69
This is helpful for writing the pack configuration into a log file.
Change-Id: I5e7f5ff7e01c9538ca12a1860844ba9b467bdf05
Signed-off-by: Edwin Kempin <edwin.kempin@sap.com>
This is helpful for writing the repository statistics into a log file.
Change-Id: I0e8cd9ad05f123ab3851960890a50213f353a373
Signed-off-by: Edwin Kempin <edwin.kempin@sap.com>
The bitmap code in PackWriter knows exactly when to use a pack as
a "cached pack". It enables cached pack usage only when the pack
has a bitmap and its entire closure of objects needs to be sent.
This is a much simpler code path to maintain, and JGit actually
has a way to write the necessary index.
Change-Id: I2645d482f8733fdf0c4120cc59ba9aa4d4ba6881
Just counting objects is not sufficient. There are some race
conditions with receive packs and delta base completion that
may confuse such a simple algorithm.
Instead always do the larger set computations, and rely on the
PackWriter having no objects pending as the way to avoid creating
an empty pack file.
Change-Id: Ic81fefb158ed6ef8d6522062f2be0338a49f6bc4
Let the pack description copy the relevant stats values. This
moves it out of the garbage collector and compactor algorithms,
co-locating with something that might care.
Remove some unnecessary code from the DfsPackCompactor, the stats
tracks the same information and can supply it.
Change-Id: Id64ab38d507c0ed19ae0d106862d175b7364eba3
Prefer ~(N+1) to ^1~N. Although both are correct, the former is
cleaner and matches "git name-rev".
Change-Id: I772001a219e5eb346f5552c92e6d98c70b2cfa98
The walk logic does not use RevWalk because it needs to walk all paths
to each of the requested commits, keeping track of each path along which
the commit was found in the RevCommit subclass. From these paths, a
single "best" path is chosen based on the total path length, with a
penalty applied for paths that traverse merges.
This functionality parallels "git name-rev".
Change-Id: I92bfb47dd16c898313d2ee525395609c3bf72ebe
This fixes two cases:
- A folder without tracked content exist both in the workdir and merged
commit, as long as there names within that folder does not conflict.
- An empty folder structure exists with the same name as a file in the
merged commit.
Bug: 402834
Change-Id: I4c5b9f11313dd1665fcbdae2d0755fdb64deb3ef
Clients send a bunch of unknown objects to UploadPack on each round
of negotiation. Many of these are not known to the server, which
leads the implementation to be looking at indexes for garbage packs.
Disable examining the index of a garbage pack, allowing servers to
avoid reading them from disk during negotiation.
The effect of this change is the server will only ACK a have line
if the object was reachable during the last garbage collection,
or was recently added to the repository. For most repositories
there is no impact in this behavior change.
If a repository rewinds a branch, runs GC, and then resets the
branch back to where it was before, the now current tip is going to
be skipped by this change. A client that has the commit may wind up
getting a slightly larger data transfer from the server as an older
common ancestor will be chosen during negotiation. This is fixable
on the server side by running GC again to correct the layout of
objects in pack files.
Change-Id: Icd550359ef70fc7b701980f9b13d923fd13c744b
The DHT backend was very slow at parsing objects. To work around
that performance limitation I obfuscated UploadPack by folding both
the want and have sets together in a single parse queue. Since DHT
was removed the complexity is no longer constructive to JGit.
Doing this refactoring prepares the code for a slightly future
change where the have lines need to be handled specially from the
want lines. Splitting the parsing up into two phases makes such
a modification trivial.
Change-Id: If7aad533b82448bbb688278e21f709282e5ccf4b
Garbage is unlikely to be used by a reader. Ensure they always
cluster at the end of the search list, no matter what timestamp
was used on the pack files.
Change-Id: I3bed89e9569ee3363c36bb3f73fcd34057a3883f
If a repository has significant amounts of unreachable garbage the
final phase to coalesce it can take longer than any other part of the
garbage collection phase. Provide a setting for applications to tweak
the threshold where coalescing ends and files just remain on disk.
Change-Id: I5f11a998a7185c75ece3271d8bc6181bb83f54c1
Rebase computes the list of commits that are included in
the merges, just like Git does, so do not try to include
the merge commits. Re-recreating merges during rebase is
a bit more complicated and might be a useful future extension,
but for now just linearize during rebase.
Change-Id: I61239d265f395e5ead580df2528e46393dc6bdbd
Signed-off-by: Robin Stocker <robin@nibor.org>
The new option EMPTY_DIRECTORIES_ONLY will make delete() only delete
empty directories. Any attempt to delete files will fail. Can be
combined with RECURSIVE to wipe out entire tree structures and
IGNORE_ERRORS to silently ignore any files or non-empty directories.
Change-Id: Icaa9a30e5302ee5c0ba23daad11c7b93e26b7445
Signed-off-by: Robin Stocker <robin@nibor.org>
This ensures that OSGi consumers can retrieve this dependency from the
JGit or EGit p2 repository.
Change-Id: I6f88a4914a19e4e18aa60d59b0cc8a33b61f7fc2
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
If a pack file has been marked invalid due to a prior IOException
accessing its contents, do not offer its bitmap index to callers.
The pack cannot be used so its bitmap should be off limits from
any reader trying to work from a bitmap.
Change-Id: Ia44e46558abdddee560bb184158b1e0af9437eee
Bitmaps provide a huge performance boost for counting objects and they
play nice with the cgit implementation.
Change-Id: I33b05a6c8f1ee2df7770f0b9fdc50d0b4bbf1029
Update the dfs and file GC implementations to prepare and write
bitmaps on the packs that contain the full closure of the object
graph. Update the DfsPackDescription to include the index version.
Change-Id: I3f1421e9cd90fe93e7e2ef2b8179ae2f1ba819ed
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
A pack bitmap index is an additional index of compressed
bitmaps of the object graph. Furthermore, a logical API of the index
functionality is included, as it is expected to be used by the
PackWriter.
Compressed bitmaps are created using the javaewah library, which is a
word-aligned compressed variant of the Java bitset class based on
run-length encoding. The library only works with positive integer
values. Thus, the maximum number of ObjectIds in a pack file that
this index can currently support is limited to Integer.MAX_VALUE.
Every ObjectId is given an integer mapping. The integer is the
position of the ObjectId in the complete ObjectId list, sorted
by offset, for the pack file. That integer is what the bitmaps
use to reference the ObjectId. Currently, the new index format can
only be used with pack files that contain a complete closure of the
object graph e.g. the result of a garbage collection.
The index file includes four bitmaps for the Git object types i.e.
commits, trees, blobs, and tags. In addition, a collection of
bitmaps keyed by an ObjectId is also included. The bitmap for each entry
in the collection represents the full closure of ObjectIds reachable
from the keyed ObjectId (including the keyed ObjectId itself). The
bitmaps are further compressed by XORing the current bitmaps against
prior bitmaps in the index, and selecting the smallest representation.
The XOR'd bitmap and offset from the current entry to the position
of the bitmap to XOR against is the actual representation of the entry
in the index file. Each entry contains one byte, which is currently
used to note whether the bitmap should be blindly reused.
Change-Id: Id328724bf6b4c8366a088233098c18643edcf40f
Update the ObjectReuseAsIs API to support creating new
ObjectToPack with only the AnyObjectId and Git object type. This is
needed to support the future pack index bitmaps, which only contain
this information and do not want the overhead of creating a temporary
object for every ObjectId.
Change-Id: I906360b471412688bf429ecef74fd988f47875dc