HistogramDiff outperforms it for any case where PatienceDiff needs to
fallback to another algorithm. Consequently it's not worth keeping
around, because we would always want a fallback enabled.
Change-Id: I39b99cb1db4b3be74a764dd3d68cd4c9ecd91481
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Its behavior is similar to PatienceDiff, and runs nearly as fast,
often beating the performance of MyersDiff.
Change-Id: I43c3faefa8109f1a68ef57522bec9cf27b5df252
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Large objects stored as deltas get unpacked by JGit into a loose
object, so they are cheaper to access later on. This unpacking was
broken because TeeInputStream copied the wrong length into the loose
object, sometimes copying too many bytes into the result. This
created a loose object that did not have the correct content, and
whose length did not match the length denoted in the object header.
Change-Id: I3ce1fd9f3dc5bd195249c7872b3bec49570424a2
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When passing to a fallback algorithm, we can avoid creating a new copy
of the hash codes for each sequence by passing in the hashed sequences
directly. This makes it cheaper to switch from HistogramDiff down to
MyersDiff in a single pass.
Change-Id: Ibf2e81be57c083862eeb134279aed676653bf9b5
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
There is a corner case where we get an EMPTY region during recursion,
but we didn't expect to receive that. Its harmless to ignore the
region since the region is empty and has no content, so do so rather
than throwing an exception
Change-Id: I50dcec81ecba763072bb739adfab5879fb48b23a
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Certain inputs caused an infinite loop because the prior match data
couldn't be used as expected. Rather than incrementing the match
pointer before looking at an element, do it after, so the loop breaks
when we wrap around to the starting point.
Change-Id: Ieab28bb3485a914eeddc68aa38c256f255dd778c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The need for branching becomes more pressing with pull
support: we need to make sure the upstream configuration entries
are written correctly when creating and renaming branches
(and of course are cleaned up when deleting them).
This adds support for listing, adding, deleting and renaming
branches including the more common options.
Bug: 326938
Change-Id: I00bcc19476e835d6fd78fd188acde64946c1505c
Signed-off-by: Mathias Kinzler <mathias.kinzler@sap.com>
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
In bug 323571 it is mentioned that if you call
'toURI().toURL().toString()' on a java.io.File you cannot pass
that string to jgit as an URIish. Problem is that the passed
URI looks like 'file:/C:/a/b.txt' and that we where expecting
double slashes after scheme':'. This fix adds support for this
single-slash file URLs.
Bug: 323571
Change-Id: I866a76a4fcd0c3b58e0d26a104fc4564e7ba5999
Signed-off-by: Christian Halstrick <christian.halstrick@sap.com>
This is the minimal implementation of a "Pull" command. It does not
have any parameters besides the generic progress monitor and timeout.
It works on the currently checked-out branch and assumes that the
configuration contains the keys "branch.<branch name>.remote" and
"branch.<branch name>.merge" to determine the remote configuration
for the fetch and the remote branch name for the merge.
Bug: 303404
Change-Id: I7fe09029996d0cfc09a7d8f097b5d6af1488fa93
Signed-off-by: Mathias Kinzler <mathias.kinzler@sap.com>
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
There where quite some bugs regarding wrong URI parsing. In order
to solve them the parsing has to be refactored. We now have
specialized regexps for 'scheme://host/...', scp URIs and local
file names. Now we can detect problems while parsing 'git://host:/abc' which
was previously not possible.
Bug: 315571
Bug: 292897
Bug: 307017
Bug: 323571
Bug: 317388
Change-Id: If72576576ebb6b9d9dc8b7e51ddd87c9909e8b62
Signed-off-by: Christian Halstrick <christian.halstrick@sap.com>
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
The regular expression which should handle the
user/password part in an URI was potentially
processing too many chars. This led to problems
when user/pwd and port was specified
Change-Id: I87db02494c4b367283e1d00437b1c06d2c8fdd28
Signed-off-by: Christian Halstrick <christian.halstrick@sap.com>
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
The regular expressions used to parse URI's are constructed by
concatenating different segments to a big String. Introduce
String constants for these segements and document them.
Change-Id: If8b9dbaaf57ca333ac0b6c9610c3d3a515c540f9
Signed-off-by: Christian Halstrick <christian.halstrick@sap.com>
Some strings were not externalized. Also use them in HTTP tests to
ensure that they will also succeed when message bundles are
translated.
Change-Id: Id02717176557e7d57e676e1339cd89f2be88d330
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
The strings used to construct the regex to parse
URIs are split differently. This makes it easier
to introduce meaningful String constants later on.
Change-Id: I9355fd42e57e0983204465c5d6fe5b6b93655074
Signed-off-by: Christian Halstrick <christian.halstrick@sap.com>
Projects like org.eclipse.mdt contain large XML files about 6 MiB
in size. So does the Android project platform/frameworks/base.
Doing a clone of either project with JGit takes forever to checkout
the files into the working directory, because delta decompression
tends to be very expensive as we need to constantly reposition the
base stream for each copy instruction. This can be made worse by
a very bad ordering of offsets, possibly due to an XML editor that
doesn't preserve the order of elements in the file very well.
Increasing the threshold to the same limit PackWriter uses when
doing delta compression (50 MiB) permits a default configured
JGit to decompress these XML file objects using the faster
random-access arrays, rather than re-seeking through an inflate
stream, significantly reducing checkout time after a clone.
Since this new limit may be dangerously close to the JVM maximum
heap size, every allocation attempt is now wrapped in a try/catch
so that JGit can degrade by switching to the large object stream
mode when the allocation is refused. It will run slower, but the
operation will still complete.
The large stream mode will run very well for big objects that aren't
delta compressed, and is acceptable for delta compressed objects that
are using only forward referencing copy instructions. Copies using
prior offsets are still going to be horrible, and there is nothing
we can do about it except increase core.streamFileThreshold.
We might in the future want to consider changing the way the delta
generators work in JGit and native C Git to avoid prior offsets once
an object reaches a certain size, even if that causes the delta
instruction stream to be slightly larger. Unfortunately native
C Git won't want to do that until its also able to stream objects
rather than malloc them as contiguous blocks.
Change-Id: Ief7a3896afce15073e80d3691bed90c6a3897307
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
Natively support the HTTP basic and digest authentication methods
by setting the Authorization header without going through the JREs
java.net.Authenticator API. The Authenticator API is difficult to
work with in a multi-threaded server environment, where its using
a singleton for the entire JVM. Instead compute the Authorization
header from the URIish user and pass, if available.
Change-Id: Ibf83fea57cfb17964020d6aeb3363982be944f87
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
As findbugs pointed out, there was a small risk for creating multiple instances of
translation bundles. If that happens, drop the second instance.
Change-Id: I3aacda86251d511f6bbc2ed7481d561449ce3b6c
Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>
HistogramDiff is an alternative implementation of patience diff,
performing a search over all matching locations and picking the
longest common subsequence that has the lowest occurrence count.
If there are unique common elements, its behavior is identical to
that of patience diff.
Actual performance on real-world source files usually beats
MyersDiff, sometimes by a factor of 3, especially for complex
comparators that ignore whitespace.
Change-Id: I1806cd708087e36d144fb824a0e5ab7cdd579d73
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Pass through the addAll request to our underlying ArrayList.
This way the underlying ArrayList grows no more than once during the
call, which may be important if the list was originally allocated
at the default size of 16, but 64 Edits are being added.
Change-Id: I31c3261e895766f82c3c832b251a09f6e37e8860
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
FetchCommand was missing the ability to set dry run and thin
preferences on the transport operation.
Change-Id: I0bef388a9b8f2e3a01ecc9e7782aaed7f9ac82ce
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
Optional inCore parameter to Resolver/Strategy will
instruct it to perform all the operations in memory
and avoid modifying working folder even if there is one.
Change-Id: I5b873dead3682f79110f58d7806e43f50bcc5045
The hash code returned by RawTextComparator (or that is used
by the SimilarityIndex) play an important role in the speed of
any algorithm that is based upon them. The lower the number of
collisions produced by the hash function, the shorter the hash
chains within hash tables will be, and the less likely we are to
fall into O(N^2) runtime behaviors for algorithms like PatienceDiff.
Our prior hash function was absolutely horrid, so replace it with
the proper definition of the DJB hash that was originally published
by Professor Daniel J. Bernstein.
To support this assertion, below is a table listing the maximum
number of collisions that result when hashing the unique lines in
each source code file of 3 randomly chosen projects:
test_jgit: 931 files; 122 avg. unique lines/file
Algorithm | Collisions
-------------+-----------
prior_hash 418
djb 5
sha1 6
string_hash31 11
test_linux26: 30198 files; 258 avg. unique lines/file
Algorithm | Collisions
-------------+-----------
prior_hash 8675
djb 32
sha1 8
string_hash31 32
test_frameworks_base: 8381 files; 184 avg. unique lines/file
Algorithm | Collisions
-------------+-----------
prior_hash 4615
djb 10
sha1 6
string_hash31 13
We can clearly see that prior_hash performed very poorly, resulting
in 8,675 collisions (elements in the same hash bucket) for at least
one file in the Linux kernel repository. This leads to some very
bad O(N) style insertion and lookup performance, even though the
hash table was sized to be the next power-of-2 larger than the
total number of unique lines in the file.
The djb hash we are replacing prior_hash with performs closer to
SHA-1 in terms of having very few collisions. This indicates it
provides a reasonably distributed output for this type of input,
despite being a much simpler algorithm (and therefore will be much
faster to execute).
The string_hash31 function is provided just to compare results with,
it is the algorithm commonly used by java.lang.String hashCode().
However, life isn't quite this simple.
djb produces a 32 bit hash code, but our hash tables are always
smaller than 2^32 buckets. Mashing the 32 bit code into an array
index used to be done by simply taking the lower bits of the hash
code by a bitwise and operator. This unfortuntely still produces
many collisions, e.g. 32 on the linux-2.6 repository files.
From [1] we can apply a final "cleanup" step to the hash code to
mix the bits together a little better, and give priority to the
higher order bits as they include data from more bytes of input:
test_jgit: 931 files; 122 avg. unique lines/file
Algorithm | Collisions
-------------+-----------
prior_hash 418
djb 5
djb + cleanup 6
test_linux26: 30198 files; 258 avg. unique lines/file
Algorithm | Collisions
-------------+-----------
prior_hash 8675
djb 32
djb + cleanup 7
test_frameworks_base: 8381 files; 184 avg. unique lines/file
Algorithm | Collisions
-------------+-----------
prior_hash 4615
djb 10
djb + cleanup 7
This is a massive improvement, as the number of collisions for
common inputs drops to acceptable levels, and we haven't really
made the hash functions any more complex than they were before.
[1] http://lkml.org/lkml/2009/10/27/404
Change-Id: Ia753b695de9526a157ddba265824240bd05dead1
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
As it turns out, every single diff algorithm we might try to
implement can benfit from using the SequenceComparator's native
concept of the simple reduceCommonStartEnd() step. For most inputs,
there can be a significant number of elements that can be removed
from the space the DiffAlgorithm needs to consider, which will
reduce the overall running time for the final solution.
Pool this logic inside of DiffAlgorithm itself as a default, but
permit a specific algorithm to override it when necessary.
Convert MyersDiff to use this reduction to reduce the space it
needs to search, making it perform slightly better on common inputs.
Change-Id: I14004d771117e4a4ab2a02cace8deaeda9814bc1
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PatienceDiff always uses a HashedSequence, which promises to provide
constant time access for hash codes during the equals method and
aborts fast if the hash codes don't match. Therefore we don't need
to cache the hash codes inside of the index, saving us memory.
Change-Id: I80bf1e95094b7670e6c0acc26546364a1012d60e
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Most diff implementations really want to use cached hash codes for
elements, rather than element equality, as they need to perform many
compares and unique hash codes for elements can really speed that
process up.
To make it easier to define element hash functions, move the caching
of hash codes into a wrapper sequence type, so that individual
sequence types like RawText don't need to do this themselves. This
has a nice property of also allowing the sequence to no longer care
about the specific SequenceComparator that is going to be used, and
permits the caching to only examine the middle region that isn't
common to the two inputs.
Change-Id: If8623556da9419117b07c5073e8bce39de02570e
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This is a faster exact match based form that tries to improve
performance for the common case of the header and trailer of
a text file not changing at all. After this fast path we use
the slower path based on the super class' using equals() to
allow for whitespace ignore modes to still work.
Some simple performance testing showed a major improvement over the
older implementation for a common edit we see in JGit. The test
compared blob 29a89bc and 372a978, which is the ObjectDirectory.java
file difference in commit 41dd9ed1c0.
The two text files are approximately 22 KiB in size.
DEFAULT old 203900 ns
DEFAULT new 100400 ns
This new version is 2x faster for the DEFAULT comparator, which does
not treat space specially. This is because we can now examine a
larger swath of text with fewer instructions per byte compared. The
older algorithm had to stop at each line break and recompute how to
examine the next line, while the new algorithm only stops when the
first difference is found.
WS_IGNORE_ALL old 298500 ns
WS_IGNORE_ALL new 63300 ns
Its 4.7x faster for the whitespace ignore comparator, as the common
header and footer do not have a whitespace difference. Avoiding the
special case handling for whitespace on each byte considered saves a
lot of time.
Since most edits to source code (and other text like files) appears in
the interior of the file, fast elimination of common header/footer
means faster diff throughput. In the less common case of an actual
header or footer edit, the common header/footer elimination is stopped
rather quickly either way, so there is very little downside to the
optimiation applied here.
Change-Id: I1d501b4c3ff80ed086b20bf12faf51ae62167db7
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
DiffAlgorithm implementations may find it useful to construct an Edit
and use that to later subsequence the two base sequences, so define
two new utility methods a() and b() to construct the A and B ranges.
Once a subsequence has had Edits created for it the indexes are
within the space of the subsequence. These must be shifted back to
the original base sequence's indexes. Define toBase() as a utility
method to perform that shifting work in-place, so DiffAlgorithm
implementations have an efficient way to convert back to the caller's
original space.
Change-Id: I8d788e4d158b9f466fa9cb4a40865fb806376aee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Adds API for performing git fetch operations.
Change-Id: Idd95664fd4e3bca03211e4ffda3e354849f92a35
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
If a thin pack has a large delta we need to be able to open
its cached copy from the loose object directory through the
CachedObjectDatabase handle. Unfortunately that did not support the
openObject2 method, which the LargePackedDeltaObject used directly
to bypass looking at the pack files.
Bug: 324868
Change-Id: I1d5886a6c3254c6dea2852d50b8614c31a93e615
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When running IndexPack we use a CachedObjectDirectory, which
knows what objects are loose and tries to avoid stat(2) calls for
objects that do not exist in the repository, as stat(2) on Win32
is very slow.
However large delta objects found in a pack file are expanded into
a loose object, in order to avoid costly delta chain processing
when that object is used as a base for another delta.
If this expand occurs while working with the CachedObjectDirectory,
we need to update the cached directory data to include this new
object, otherwise it won't be available when we try to open it
during the object verify phase.
Bug: 324868
Change-Id: Idf0c76d4849d69aa415ead32e46a435622395d68
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When creating a new FileRepository, probe the capability of the
local filesystem and set core.filemode based on how it reacts.
We can't just rely on FS.supportsExecute() because a POSIX system
(which usually does support execute) might be storing the repository
on a partition that doesn't have execute support (e.g. plain FAT-32).
Creating a temporary file, setting both states, checking we get
the desired results will let us set the variable correctly on
all systems.
Change-Id: I551488ea8d352d2179c7b244f474d2e3d02567a2
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
In PlotCommitList.enter() commits are positioned on lanes for visual
presentation. This implementation was buggy: commits without
children (often the starting points for the RevWalk) are not positioned
on separate lanes.
The problem was that when handling commits with multiple children
(that's where branches fork out) it was not handled that some of the
children may not have been positioned on a lane yet. I fixed that and
added a number of tests which specifically test the layout of commits
on lanes.
Bug: 300282
Bug: 320263
Change-Id: I267b97ecccb5251cec54cec90207e075ab50503e
Signed-off-by: Christian Halstrick <christian.halstrick@sap.com>
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
A diff algorithm may find this type useful if it wants to delegate a
particular range of elements to another algorithm, without changing
the underlying sequence types.
Change-Id: I4544467781233e21ac8b35081304b2bad7db00f6
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This makes it easier to parametrize DiffFormatter with a different
implementation, as we later plan to add PatienceDiff to JGit.
Change-Id: Id35ef478d5fa20fe10a1ba297f9436fd7adde9ce
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
git allows remotes to be relative paths, but the regex
validating urls wouldn't accept anything starting with "..".
Other functionality works fine with these paths.
Bug: 311300
Change-Id: Ib74de0450a1c602b22884e19d994ce2f52634c77
Instead of spooling large delta bases into temporary files and then
immediately deleting them afterwards, spool the large delta out to
a normal loose object. Later any requests for that large delta can
be answered by reading from the loose object, which is much easier
to stream efficiently for readers.
Since the object is now duplicated, once in the pack as a delta and
again as a loose object, any future prune-packed will automatically
delete the loose object variant, releasing the wasted disk space.
As prune-packed is run automatically during either repack or gc, and
gc --auto triggers automatically based on the number of loose objects,
we get automatic cache management for free. Large objects that were
unpacked will be periodically cleared out, and will simply be restored
later if they are needed again.
After a short offline discussion with Junio Hamano today, we may want
to propose a change to prune-packed to hold onto larger loose objects
which also exist in pack files as deltas, if the loose object was
recently accessed or modified in the last 2 days.
Change-Id: I3668a3967c807010f48cd69f994dcbaaf582337c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Recently created objects are usually what branches point to, and
are usually written out as loose objects. But due to the high cost
of asking the operating system if a file exists, these are the last
thing that ObjectDirectory examines when looking for an object by
its ObjectId.
Caching recently seen loose objects permits the opening code to
jump directly to the loose object, accelerating lookup for branch
heads that are accessed often.
To avoid exploding the cache its limited to approximately 2048
entries. When more ids are added, the table is simply cleared
and reset in size.
Change-Id: I18f483217412b102f754ffd496c87061d592e535
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This class is used only to cache the unpacked form of an object that
was used as a base for another object. The theory goes that if an
object is used as a delta base for A, it will probably also be a
delta base for B, C, D, E, etc. and therefore having an unpacked copy
of it on hand will make delta resolution for the others very fast.
However since objects are usually only accessed once, we don't want
to cache everything we unpack, just things that we are likely to
need again. The only things we need again are the delta bases.
Hence, its a delta base cache.
This gets us the class name UnpackedObjectCache back, so we can
use it to actually create a cache of unpacked object information.
Change-Id: I121f356cf4eca7b80126497264eac22bd5825a1d
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>