The FollowFilter can be installed on a RevWalk to cause the path
to be updated through rename detection when the affected file is
found to be added to the project.
The filter works reasonably well, for example we can follow the
history of the fsck command in git-core:
$ jgit log --name-status --follow builtin/fsck.c | grep ^R
R100 builtin-fsck.c builtin/fsck.c
R099 fsck.c builtin-fsck.c
R099 fsck-objects.c fsck.c
R099 fsck-cache.c fsck-objects.c
Change-Id: I4017bcfd150126aa342fdd423a688493ca660a1f
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This way we don't have to reparse for the rename limit every time
we create a new rename detector for a repository.
Change-Id: I669d031690b85ef4da5e39189be7173fb773fc56
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Similar to what we did with diff, implement whitespace ignore options
for log too. This requires us to define some means of creating any
RawText object type at will inside of DiffFormatter, so we define a
new factory interface to construct RawText instances on demand.
Unfortunately we have to copy the entire block of common options.
args4j only processes the options/arguments on the one command class
and Java doesn't support multiple inheritance.
Change-Id: Ia16cd3a11b850fffae9fbe7b721d7e43f1d0e8a5
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Instead of crashing, output a submodule link with the simple
"Subproject commit $fullid\n" syntax used by C Git.
Change-Id: Iae8646941683fb19b73fb038217d2e3bf5f77fa9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Passing around the OutputStream and the Repository is crazy. Instead
put the stream in the constructor, since this formatter exists only to
output to the stream, and put the repository as a member variable that
can be optionally set.
Change-Id: I2bad012fee7f40dc1346700ebd19f1e048982878
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Implement rename detection in the command line diff and log commands.
Also support --name-status, -p and -U flags, as these can be quite
useful to view more detail.
All of the Git patch file formatting code is now moved over to the
DiffFormatter class. This permits us to reuse it in any context,
including inside of IDEs.
Change-Id: I687ccba34e18105a07e0a439d2181c323209d96c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Content similarity based rename detection is performed only after
a linear time detection is performed using exact content match on
the ObjectIds. Any names which were paired up during that exact
match phase are excluded from the inexact similarity based rename,
which reduces the space that must be considered.
During rename detection two entries cannot be marked as a rename
if they are different types of files. This prevents a symlink from
being renamed to a regular file, even if their blob content appears
to be similar, or is identical.
Efficiently comparing two files is performed by building up two
hash indexes and hashing lines or short blocks from each file,
counting the number of bytes that each line or block represents.
Instead of using a standard java.util.HashMap, we use a custom
open hashing scheme similiar to what we use in ObjecIdSubclassMap.
This permits us to have a very light-weight hash, with very little
memory overhead per cell stored.
As we only need two ints per record in the map (line/block key and
number of bytes), we collapse them into a single long inside of
a long array, making very efficient use of available memory when
we create the index table. We only need object headers for the
index structure itself, and the index table, but not per-cell.
This offers a massive space savings over using java.util.HashMap.
The score calculation is done by approximating how many bytes are
the same between the two inputs (which for a delta would be how much
is copied from the base into the result). The score is derived by
dividing the approximate number of bytes in common into the length
of the larger of the two input files.
Right now the SimilarityIndex table should average about 1/2 full,
which means we waste about 50% of our memory on empty entries
after we are done indexing a file and sort the table's contents.
If memory becomes an issue we could discard the table and copy all
records over to a new array that is properly sized.
Building the index requires O(M + N log N) time, where M is the
size of the input file in bytes, and N is the number of unique
lines/blocks in the file. The N log N time constraint comes
from the sort of the index table that is necessary to perform
linear time matching against another SimilarityIndex created for
a different file.
To actually perform the rename detection, a SxD matrix is created,
placing the sources (aka deletions) along one dimension and the
destinations (aka additions) along the other. A simple O(S x D)
loop examines every cell in this matrix.
A SimilarityIndex is built along the row and reused for each
column compare along that row, avoiding the costly index rebuild
at the row level. A future improvement would be to load a smaller
square matrix into SimilarityIndexes and process everything in that
sub-matrix before discarding the column dimension and moving down
to the next sub-matrix block along that same grid of rows.
An optional ProgressMonitor is permitted to be passed in, allowing
applications to see the progress of the detector as it works through
the matrix cells. This provides some indication of current status
for very long running renames.
The default line/block hash function used by the SimilarityIndex
may not be optimal, and may produce too many collisions. It is
borrowed from RawText's hash, which is used to quickly skip out of
a longer equality test if two lines have different hash functions.
We may need to refine this hash in the future, in order to minimize
the number of collisions we get on common source files.
Based on a handful of test commits in JGit (especially my own
recent rename repository refactoring series), this rename detector
produces output that is very close to C Git. The content similarity
scores are sometimes off by 1%, which is most probably caused by
our SimilarityIndex type using a different hash function than C
Git uses when it computes the delta size between any two objects
in the rename matrix.
Bug: 318504
Change-Id: I11dff969e8a2e4cf252636d857d2113053bdd9dc
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Alex pointed out that my description of a bare repository might be
confusing for some readers. Reword the description of the error,
and make it consistent throughout the Repository class's API.
Change-Id: I87929ddd3005f578a7022f363270952d1f7f8664
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
During code review, Alex raised a few comments about commit
532421d989 ("Refactor repository construction to builder class").
Due to the size of the related series we aren't going to go back
and rebase in something this minor, so resolve them as a follow-up
commit instead.
Change-Id: Ied52f7a8f7252743353c58d20bfc3ec498933e00
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Now that any large objects are forced through a streaming loader
when its bigger than getStreamFileThreshold(), and that threshold
is pegged at Integer.MAX_VALUE as its largest size, we will never
be able to reach this code path where we threw OutOfMemoryError.
Robin pointed out that we probably should include a message here,
but the code is effectively unreachable, so there isn't any value
in adding a message at this point.
So remove it.
Change-Id: Ie611d005622e38a75537f1350246df0ab89dd500
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Since we don't know the type of object we are parsing, we don't
know if its a massive blob, or some small commit or annotated tag.
Avoid pulling the cached bytes until we have checked the type and
decided if we actually need them to continue parsing right now.
This way large blobs which won't fit in memory and would throw
a LargeObjectException don't abort parsing.
Change-Id: Ifb70df5d1c59f616aa20ee88898cb69524541636
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Callers don't necessarily need the getSize() result from a large
delta. They instead should be always using openStream() or copyTo()
for blobs going to local files, or they should be checking the
result of the constant-time isLarge() method to determine the type
of access they can use on the ObjectLoader. Avoid inflating the
delta instruction stream twice by delaying the decoding of the size
until after we have created the DeltaStream and decoded the header.
Likewise with the type, callers don't necessarily always need it
to be present in an ObjectLoader. Delay looking at it as late as
we can, thereby avoiding an ugly O(N^2) loop looking up the type
for every single object in the entire delta chain.
Change-Id: I6487b75b52a5d201d811a8baed2fb4fcd6431320
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
We default this to 1 MiB for now, but we allow users to modify
it through the Repository's configuration file to be a different
value. A new repository listener is used to identify when the
setting has been updated and trigger a reconfiguration of any
active ObjectReaders.
To prevent a horrible explosion we cap core.streamFileThreshold
at no more than 1/4 of the maximum JVM heap size. We do this
because we need at least 2 byte arrays equal in size to the
stream threshold for the worst case delta inflation scenario,
and our host application probably also needs some amount of the
heap for their working set size.
Change-Id: I103b3a541dc970bbf1a6d92917a12c5a1ee34d6c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Very large delta instruction streams, or deltas which use very large
base objects, are now streamed through as large objects rather than
being inflated into a byte array.
This isn't the most efficient way to access delta encoded content, as
we may need to rewind and reprocess the base object when there was a
block moved within the file, but it will at least prevent the JVM from
having its heap explode.
When streaming a delta we have an inflater open for each level in the
delta chain, to inflate the instruction set of the delta, as well as
an inflater for the base level object. The base object is buffered,
as is the top level delta requested by the application, but we do not
buffer the intermediate delta streams. This keeps memory usage lower,
so its closer to 1024 bytes per level in the chain, without having an
adverse impact on raw throughput as the top-level buffer gets pushed
down to the lowest stream that has the next region.
Delta instructions transparently collapse here, if the top level does
not copy a region from its base, the base won't materialize that part
from its own base, etc. This allows us to avoid copying around a lot
of segments which have been deleted from the final version.
Change-Id: I724d45245cebb4bad2deeae7b896fc55b2dd49b3
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Similar to the loose object support, whole packed objects can
now be streamed back to the caller. The streaming is less
efficient as we copy the data from the cached window array
into the InflaterInputStream's internal buffer, then inflate
it there before returning to the application.
Like with unpacked objects, there is plenty of room for some
optimization, especially for the copyTo method, where we don't
necessarily need so much buffering to exist.
Change-Id: Ie23be81289e37e24b91d17b0891e47b9da988008
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The class is identical, but ObjectLoader.SmallObject is part of our
public API for storage implementations to build on top of.
Change-Id: I381a3953b14870b6d3d74a9c295769ace78869dc
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Big loose objects can now be streamed if they are over the large
object size threshold. This prevents the JVM heap from exploding
with a very large byte array to hold the slurped file, and then
again with its uncompressed copy.
We may have slightly slowed down the simple case for small
loose objects, as the loader no longer slurps the entire thing
and decompresses in memory. To try and keep good performance
for the very common small objects that are below 8 KiB in size,
buffers are set to 8 KiB, causing the reader to slurp most of the
file anyway. However the data has to be copied at least once,
from the BufferedInputStream into the InflaterInputStream.
New unit tests are supplied to get nearly 100% code coverage on the
unpacked code paths, for both standard and pack style loose objects.
We tested a fair chunk of the code elsewhere, but these new tests
are better isolated to the specific branches in the code path.
Change-Id: I87b764ab1b84225e9b5619a2a55fd8eaa640e1fe
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
JGit does not currently do rename detection during diffs. I added
a class that, given a TreeWalk to iterate over, can output a list
of DiffEntry's for that TreeWalk, taking into account renames. This
class only detects renames by SHA1's. More complex rename detection,
along the lines of what C Git does will be added later.
Change-Id: I93606ce15da70df6660651ec322ea50718dd7c04
Assume that the argument of compareTo won't be mutated while we
are doing the compare, and support the wider AnyObjectId type so
MutableObjectId is suitable on either side of the compareTo call.
Change-Id: I2a63a496c0a7b04f0e5f27d588689c6d5e149d98
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This way we can stream a large file through memory, rather than
loading the entire thing into a single contiguous byte array.
Change-Id: I3ada2856af2bf518f072edec242667a486fb0df1
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Instead of loading the entire object as a byte array and passing
that into the deflater, let the ObjectLoader copy the object onto
the DeflaterOutputStream. This has the nice side effect of using
some sort of stride hack in the Sun implementation that may improve
compression performance.
Change-Id: I3f3d681b06af0da93ab96c75468e00e183ff32fe
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Only allocate the Deflater if we can't reuse everything, but also
make sure we release it when we release the PackWriter's resources.
Change-Id: I16a32b94647af0778658eda87acbafc9a25b314a
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Blobs that are too large to read as a single byte array should be
accessed through an InputStream based interface instead, allowing
the application to walk through the data stream incrementally.
Define the basic interface to support streaming contents, but don't
implement it yet for the file based backend.
Change-Id: If9e4442e9ef4ed52c3e0f1af9398199a73145516
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Refactored a superclass out of FileHeader called DiffEntry that holds
the more general data from FileHeader that is useful in rename
detection (old/new Ids, modes, names, as well as changeType and
score). FileHeader is now a DiffEntry that adds Hunks, parsing
abilities, etc.
Change-Id: I8398728cd218f8c6e98f7a4a7f2f342391d865e4
It is possible that StreamCopyThread will not flush everything
from it's src to it's dst. In most cases StreamCopyThread works
like this:
in loop:
n = src.read(buf);
dst.write(buf, 0, n);
and when we want to flush, we interrupt() StreamCopyThread and it
flushes everything it wrote to dst.
The problem is that our interrupt() could interrupt reading. In this
case we will flush everything we wrote to dst, but not everything
we wrote to src.
Change-Id: Ifaf4d8be87535c7364dd59b217dfc631460018ff
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Instead of creating the DirCache from a static factory method, use
an instance method on Repository, permitting the implementation to
override the method with a completely different type of DirCache
reading and writing. This would better support a repository in the
cloud strategy, or even just an in-memory unit test environment.
Change-Id: I6399894b12d6480c4b3ac84d10775dfd1b8d13e7
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Using a custom exception type makes it easire for an application
developer to understand why an exception was thrown out of a method
we declare. To remain compatiable with existing callers, we still
extend off IllegalStateException.
Change-Id: Ideeef2399b11ca460a2dbb3cd80eb76aa0a025ba
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Added a check in Diff to ensure that files that are most likely
not text are not line-by-line diffed. Files are determined to be
binary by checking the first 8000 bytes for a null character. This
is a similar heuristic to what C Git uses.
Change-Id: I2b6f05674c88d89b3f549a5db483f850f7f46c26
Update a number of calling sites of RevWalk to ensure the walker's
internal ObjectReader is released after the walk is no longer used.
Because the ObjectReader is likely to hold onto a native resource
like an Inflater, we don't want to leak them outside of their
useful scope.
Where possible we also try to share ObjectReaders across several
walk pools, or between a walker and a PackWriter. This permits
the ObjectReader to actually do some caching if it felt inclined
to do so.
Not everything was updated, we'll probably need to come back and
update even more call sites, but these are some of the biggest
offenders. Test cases in particular aren't updated. My plan is to
move most storage-agnostic tests onto some purely in-memory storage
solution that doesn't do compression.
Change-Id: I04087ec79faeea208b19848939898ad7172b6672
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Rather than building a custom reader, have the caller supply us one.
Change-Id: Ief2b5a6b1b75f05c8a6bc732a60d4d1041dd8254
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Instead of creating new ObjectReader for each walker, use one for
the entire connection and delegate reads through it.
Change-Id: I7f0a2ec8c9fe60b095a7be77dc423a2ff8b443a3
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
We don't actually need a Repository object here, just an ObjectReader
that can load content for us. So change the API to depend on that.
However, this breaks the asCommit and asTag legacy translation methods
on RevCommit and RevTag, so we still have to keep the Repository
inside of RevWalk for those two types. Hopefully we can drop those in
the future, and then drop the Repository off the RevWalk.
Change-Id: Iba983e48b663790061c43ae9ffbb77dfe6f4818e
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Added code to support ignoring leading, trailing, and changed
whitespace when performing a diff operation. I also added command
line options to Diff to enable the various whitespace ignoring
methods. These match the flags for git diff.
Change-Id: Ie56301aafad59ee3f0fe5de62719f5023cd702c8
We don't need this field to be volatile. Events are delivered by
the same thread that created the RepositoryEvent object, and thus
any cross-thread operations would need to be handled by some other
type of synchronization in the listener, and that would protect
both the repository field and any other per-event data.
Change-Id: Iefe345959e1a2d4669709dbf82962bcc1b8913e3
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Similar to what we did on Repository, the openObject method
already implied we wanted to open an object, given its main
argument was of type AnyObjectId. Simplify the method name
to just the action, has or open.
Change-Id: If055e5e0d8de0e2424c18a773f6d2bc2f66054f4
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
We drop the "Object" suffix, because its pretty clear here that
we want to open an object, given that we pass in AnyObjectId as
the main parameter. We also fix the calling convention to throw
a MissingObjectException or IncorrectObjectTypeException, so that
callers don't have to do this error checking themselves.
Change-Id: I72c43353cea8372278b032f5086d52082c1eee39
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Rather than taking the ProgressMonitor objects in our constructor and
carrying them around as instance fields, take them as arguments to the
actual time consuming operations we need to run.
Change-Id: I2b230d07e277de029b1061c807e67de5428cc1c4
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Rather than storing this in an instance member, pass it down the
calling stack. Its cleaner, we don't have to poke the stream as
a temporary field, and then unset it.
Change-Id: I0fd323371bc12edb10f0493bf11885d7057aeb13
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Going through ObjectReader.openObject(AnyObjectId) is faster, but
also produces cleaner application level code. The error checking
is done inside of the openObject method, which means it can be
removed from the application code.
Change-Id: Ia927b448d128005e1640362281585023582b1a3a
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
If the type hint isn't OBJ_ANY and it doesn't match the actual type
observed from the object store, define the reader to throw back an
IncorrectObjectTypeException. This way the caller doesn't have to
perform this check itself before it evaluates the object data, and
we can simplify quite a few call sites.
Change-Id: I9f0dfa033857f439c94245361fcae515bc0a6533
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
JGit did not have support for skipping whitespace when comparing
lines in RawText objects. I added a subclass of RawText that skips
whitespace in its equals and hashCode methods. I used a subclass
rather than adding functionality into RawText so that performance
would not be impacted by extra logic.
This class only supports ignoring all whitespace. Others will follow
that allow other forms of whitespace ignoring.
Change-Id: Ic2f79e85215e48d3fd53ec1b4ad13373dd183a4a
The ObjectReader API demands that we release the reader when we are
done with it. PackWriter contains a reader, which it uses for the
entire packing session. Expose the release of the reader through
a release method on the writer.
This still doesn't address the RevWalk and TreeWalk users, who
don't correctly release their reader. But its a small step in the
right direction.
Change-Id: I5cb0b5c1b432434a799fceb21b86479e09b84a0a
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
I don't want to play games with the order of release here, its
probably safer to release the reader before the database, just
in case the one depends on the other.
Change-Id: I2394c7d2477eaf7a7e1556fc3393c59d3b31e764
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
By doing the release at the higher level class, we can ensure
the release occurs if the inserter was allocated, even if the
implementation forgets to do this. Since the higher level class
is what allocated it, it makes sense to have it also do the release.
Change-Id: Id617b2db864c3208ed68cba4eda80e51612359ad
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Everyone else does. This must have been a spot I missed during
some sort of squash while developing the series.
Change-Id: I62eae50b618f47ee33ad7cf71fc05b724f603201
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Similar to what we did with the file code, move the pack writer
into its own package so the related classes and their package
private methods are hidden from the rest of the library.
Change-Id: Ic1b5c7c8c8d266e90c910d8d68dfc8e93586854f
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
We no longer need an ObjectLoader to be lazy and try to delay
the materialization of the object content. That was done only
to support PackWriter searching for a good reuse candidate.
Instead, simplify the code base by doing the materialization
immediately when the loader asks for it, because any caller
asking for the loader is going to need the content.
Change-Id: Id867b1004529744f234ab8f9cfab3d2c52ca3bd0
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
These were only used by PackWriter to help it filter object
representations. Their only user disappeared when we rewrote the
object selection code path to use the new representation type.
Change-Id: I9ed676bfe4f87fcf94aa21e53bda43115912e145
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>