Added support for converting DiffEntrys to FileHeaders. FileHeaders
are DiffEntrys with a buffer containing the diff output as well as
a list of HunkHeaders. The HunkHeaders contain EditLists. The
createFileHeader(DiffEntry) method in DiffFormatter performs a Myers
Diff on the files refered to by the DiffEntry, then puts the returned
EditList into a single HunkHeader, which is then put into the
FileHeader to be returned. It also generates the appropriate diff
header an puts it into the FileHeader's buffer. The rest of the diff
output, which would normally be parsed to generate the HunkHeaders,
is not generated. In fact, the purpose of this method is to avoid
the costly diff output generation and parsing normally required to
create a FileHeader.
Change-Id: I7d8b18c0f6c85e3d02ad58995d3d231e69af5887
ReadTreeTest was hardcoded to test WorkDirCheckout. Since we want
alternative checkout implementations (especially DirCacheCheckout)
this class has been refactored so that the tests can be reused
to test other implementations
The following changes have been done:
- abstract methods for checkout and prescanTwoTrees have been
introduced. Parameters are only the two trees. As index we
will implicitly use the current index of the repo.
- whenever tests needed a manipulated index before checkout
and prescanTwoTrees it was ensured that the correct index was
persisted (before we could use not-persisted instantiations of GitIndex
passed as parameters to checkout, prescanTwoTrees
- abstract methods for getting updated, conflicting, removed entries
resulting from the last checkout, prescanTwoTrees have been introduced
- an implementation for all these abstract methods using WorkDirCheckout
has been added
- method to assert a certain state of the index and the working tree has
been added
Signed-off-by: Christian Halstrick <christian.halstrick@sap.com>
Change-Id: Icf177cf8043487169a32ddd72b6f8f9246a433f7
We want to get rid of these APIs, because they don't perform as well
as DirCache/TreeWalk, or don't offer nearly as many features.
Bug: 319145
Change-Id: I2b28f9cddc36482e1ad42d53e86e9d6461ba3bfc
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
We need to validate the stream state after the InflaterInputStream
thinks the stream is done. Git expects a higher level of service from
the Inflater than the InflaterInputStream usually gives, we need to
ensure the embedded CRC is valid, and that there isn't trailing
garbage at the end of the file.
Change-Id: I1c9642a82dbd76b69e607dceccf8b85dc869a3c1
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Passing around the OutputStream and the Repository is crazy. Instead
put the stream in the constructor, since this formatter exists only to
output to the stream, and put the repository as a member variable that
can be optionally set.
Change-Id: I2bad012fee7f40dc1346700ebd19f1e048982878
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Content similarity based rename detection is performed only after
a linear time detection is performed using exact content match on
the ObjectIds. Any names which were paired up during that exact
match phase are excluded from the inexact similarity based rename,
which reduces the space that must be considered.
During rename detection two entries cannot be marked as a rename
if they are different types of files. This prevents a symlink from
being renamed to a regular file, even if their blob content appears
to be similar, or is identical.
Efficiently comparing two files is performed by building up two
hash indexes and hashing lines or short blocks from each file,
counting the number of bytes that each line or block represents.
Instead of using a standard java.util.HashMap, we use a custom
open hashing scheme similiar to what we use in ObjecIdSubclassMap.
This permits us to have a very light-weight hash, with very little
memory overhead per cell stored.
As we only need two ints per record in the map (line/block key and
number of bytes), we collapse them into a single long inside of
a long array, making very efficient use of available memory when
we create the index table. We only need object headers for the
index structure itself, and the index table, but not per-cell.
This offers a massive space savings over using java.util.HashMap.
The score calculation is done by approximating how many bytes are
the same between the two inputs (which for a delta would be how much
is copied from the base into the result). The score is derived by
dividing the approximate number of bytes in common into the length
of the larger of the two input files.
Right now the SimilarityIndex table should average about 1/2 full,
which means we waste about 50% of our memory on empty entries
after we are done indexing a file and sort the table's contents.
If memory becomes an issue we could discard the table and copy all
records over to a new array that is properly sized.
Building the index requires O(M + N log N) time, where M is the
size of the input file in bytes, and N is the number of unique
lines/blocks in the file. The N log N time constraint comes
from the sort of the index table that is necessary to perform
linear time matching against another SimilarityIndex created for
a different file.
To actually perform the rename detection, a SxD matrix is created,
placing the sources (aka deletions) along one dimension and the
destinations (aka additions) along the other. A simple O(S x D)
loop examines every cell in this matrix.
A SimilarityIndex is built along the row and reused for each
column compare along that row, avoiding the costly index rebuild
at the row level. A future improvement would be to load a smaller
square matrix into SimilarityIndexes and process everything in that
sub-matrix before discarding the column dimension and moving down
to the next sub-matrix block along that same grid of rows.
An optional ProgressMonitor is permitted to be passed in, allowing
applications to see the progress of the detector as it works through
the matrix cells. This provides some indication of current status
for very long running renames.
The default line/block hash function used by the SimilarityIndex
may not be optimal, and may produce too many collisions. It is
borrowed from RawText's hash, which is used to quickly skip out of
a longer equality test if two lines have different hash functions.
We may need to refine this hash in the future, in order to minimize
the number of collisions we get on common source files.
Based on a handful of test commits in JGit (especially my own
recent rename repository refactoring series), this rename detector
produces output that is very close to C Git. The content similarity
scores are sometimes off by 1%, which is most probably caused by
our SimilarityIndex type using a different hash function than C
Git uses when it computes the delta size between any two objects
in the rename matrix.
Bug: 318504
Change-Id: I11dff969e8a2e4cf252636d857d2113053bdd9dc
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
During code review, Alex raised a few comments about commit
532421d989 ("Refactor repository construction to builder class").
Due to the size of the related series we aren't going to go back
and rebase in something this minor, so resolve them as a follow-up
commit instead.
Change-Id: Ied52f7a8f7252743353c58d20bfc3ec498933e00
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
We default this to 1 MiB for now, but we allow users to modify
it through the Repository's configuration file to be a different
value. A new repository listener is used to identify when the
setting has been updated and trigger a reconfiguration of any
active ObjectReaders.
To prevent a horrible explosion we cap core.streamFileThreshold
at no more than 1/4 of the maximum JVM heap size. We do this
because we need at least 2 byte arrays equal in size to the
stream threshold for the worst case delta inflation scenario,
and our host application probably also needs some amount of the
heap for their working set size.
Change-Id: I103b3a541dc970bbf1a6d92917a12c5a1ee34d6c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Very large delta instruction streams, or deltas which use very large
base objects, are now streamed through as large objects rather than
being inflated into a byte array.
This isn't the most efficient way to access delta encoded content, as
we may need to rewind and reprocess the base object when there was a
block moved within the file, but it will at least prevent the JVM from
having its heap explode.
When streaming a delta we have an inflater open for each level in the
delta chain, to inflate the instruction set of the delta, as well as
an inflater for the base level object. The base object is buffered,
as is the top level delta requested by the application, but we do not
buffer the intermediate delta streams. This keeps memory usage lower,
so its closer to 1024 bytes per level in the chain, without having an
adverse impact on raw throughput as the top-level buffer gets pushed
down to the lowest stream that has the next region.
Delta instructions transparently collapse here, if the top level does
not copy a region from its base, the base won't materialize that part
from its own base, etc. This allows us to avoid copying around a lot
of segments which have been deleted from the final version.
Change-Id: I724d45245cebb4bad2deeae7b896fc55b2dd49b3
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Similar to the loose object support, whole packed objects can
now be streamed back to the caller. The streaming is less
efficient as we copy the data from the cached window array
into the InflaterInputStream's internal buffer, then inflate
it there before returning to the application.
Like with unpacked objects, there is plenty of room for some
optimization, especially for the copyTo method, where we don't
necessarily need so much buffering to exist.
Change-Id: Ie23be81289e37e24b91d17b0891e47b9da988008
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The class is identical, but ObjectLoader.SmallObject is part of our
public API for storage implementations to build on top of.
Change-Id: I381a3953b14870b6d3d74a9c295769ace78869dc
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Big loose objects can now be streamed if they are over the large
object size threshold. This prevents the JVM heap from exploding
with a very large byte array to hold the slurped file, and then
again with its uncompressed copy.
We may have slightly slowed down the simple case for small
loose objects, as the loader no longer slurps the entire thing
and decompresses in memory. To try and keep good performance
for the very common small objects that are below 8 KiB in size,
buffers are set to 8 KiB, causing the reader to slurp most of the
file anyway. However the data has to be copied at least once,
from the BufferedInputStream into the InflaterInputStream.
New unit tests are supplied to get nearly 100% code coverage on the
unpacked code paths, for both standard and pack style loose objects.
We tested a fair chunk of the code elsewhere, but these new tests
are better isolated to the specific branches in the code path.
Change-Id: I87b764ab1b84225e9b5619a2a55fd8eaa640e1fe
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
JGit does not currently do rename detection during diffs. I added
a class that, given a TreeWalk to iterate over, can output a list
of DiffEntry's for that TreeWalk, taking into account renames. This
class only detects renames by SHA1's. More complex rename detection,
along the lines of what C Git does will be added later.
Change-Id: I93606ce15da70df6660651ec322ea50718dd7c04
Instead of creating the DirCache from a static factory method, use
an instance method on Repository, permitting the implementation to
override the method with a completely different type of DirCache
reading and writing. This would better support a repository in the
cloud strategy, or even just an in-memory unit test environment.
Change-Id: I6399894b12d6480c4b3ac84d10775dfd1b8d13e7
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Using a custom exception type makes it easire for an application
developer to understand why an exception was thrown out of a method
we declare. To remain compatiable with existing callers, we still
extend off IllegalStateException.
Change-Id: Ideeef2399b11ca460a2dbb3cd80eb76aa0a025ba
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
We don't actually need a Repository object here, just an ObjectReader
that can load content for us. So change the API to depend on that.
However, this breaks the asCommit and asTag legacy translation methods
on RevCommit and RevTag, so we still have to keep the Repository
inside of RevWalk for those two types. Hopefully we can drop those in
the future, and then drop the Repository off the RevWalk.
Change-Id: Iba983e48b663790061c43ae9ffbb77dfe6f4818e
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Added code to support ignoring leading, trailing, and changed
whitespace when performing a diff operation. I also added command
line options to Diff to enable the various whitespace ignoring
methods. These match the flags for git diff.
Change-Id: Ie56301aafad59ee3f0fe5de62719f5023cd702c8
We drop the "Object" suffix, because its pretty clear here that
we want to open an object, given that we pass in AnyObjectId as
the main parameter. We also fix the calling convention to throw
a MissingObjectException or IncorrectObjectTypeException, so that
callers don't have to do this error checking themselves.
Change-Id: I72c43353cea8372278b032f5086d52082c1eee39
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Rather than taking the ProgressMonitor objects in our constructor and
carrying them around as instance fields, take them as arguments to the
actual time consuming operations we need to run.
Change-Id: I2b230d07e277de029b1061c807e67de5428cc1c4
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
JGit did not have support for skipping whitespace when comparing
lines in RawText objects. I added a subclass of RawText that skips
whitespace in its equals and hashCode methods. I used a subclass
rather than adding functionality into RawText so that performance
would not be impacted by extra logic.
This class only supports ignoring all whitespace. Others will follow
that allow other forms of whitespace ignoring.
Change-Id: Ic2f79e85215e48d3fd53ec1b4ad13373dd183a4a
The ObjectReader API demands that we release the reader when we are
done with it. PackWriter contains a reader, which it uses for the
entire packing session. Expose the release of the reader through
a release method on the writer.
This still doesn't address the RevWalk and TreeWalk users, who
don't correctly release their reader. But its a small step in the
right direction.
Change-Id: I5cb0b5c1b432434a799fceb21b86479e09b84a0a
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Similar to what we did with the file code, move the pack writer
into its own package so the related classes and their package
private methods are hidden from the rest of the library.
Change-Id: Ic1b5c7c8c8d266e90c910d8d68dfc8e93586854f
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
We no longer need an ObjectLoader to be lazy and try to delay
the materialization of the object content. That was done only
to support PackWriter searching for a good reuse candidate.
Instead, simplify the code base by doing the materialization
immediately when the loader asks for it, because any caller
asking for the loader is going to need the content.
Change-Id: Id867b1004529744f234ab8f9cfab3d2c52ca3bd0
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
These were only used by PackWriter to help it filter object
representations. Their only user disappeared when we rewrote the
object selection code path to use the new representation type.
Change-Id: I9ed676bfe4f87fcf94aa21e53bda43115912e145
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This move isolates all of the local file specific implementation code
into a single package, where their package-private methods and support
classes are properly hidden away from the rest of the core library.
Because of the sheer number of files impacted, I have limited this
change to only the renames and the updated imports.
Change-Id: Icca4884e1a418f83f8b617d0c4c78b73d8a4bd17
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Only the ObjectDirectory type of database knows where to find the
objects directory on the local filesystem, so defer to it whenever
we need to know where the objects reside. Since this is the type
returned by FileRepository's getObjectDatabase() method, we mostly
don't have to do much other than use a slightly longer invocation.
Change-Id: Ie5f58132a6411b56c3acad73646ad169d78a0654
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This better matches with the name used in the environment
(GIT_WORK_TREE), in the configuration file (core.worktree),
and in our builder object.
Since we are already breaking a good chunk of other code
related to repository access, and this fairly easy to fix
in an application's code base, I'm not going to offer the
wrapper getWorkDir() method.
Change-Id: Ib698ba4bbc213c48114f342378cecfe377e37bb7
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The new FileRepositoryBuilder class helps applications to construct
a properly configured FileRepository, with properties assumed based
upon the standard Git rules for the local filesystem.
To better support simple command line applications, environment
variable handling and repository searching was moved into this
builder class.
The change gets rid of the ever-growing FileRepository constructor
variants, and the multitude of java.io.File typed parameters,
by using simple named setter methods.
Change-Id: I17e8e0392ad1dbf6a90a7eb49a6d809388d27e4c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Not every type of Repository will be able to map an ObjectId into
a local file system path that stores that object's file contents.
Heck, its not even true for the FileRepository, as an object can
be stored in a pack file and not in its loose format.
Remove this from our public API, it was a mistake to publish it.
Change-Id: I20d1b8c39104023936e6d46a5b0d7ef39ff118e8
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The WindowCache is an implementation detail of PackFile and how its
used by ObjectDirectory. Lets start to hide it and replace the public
API with a more generic concept, ObjectReader.
Because PackedObjectLoader is also considered a private detail of
PackFile, we have to make PackWriter temporarily dependent upon the
WindowCursor and thus FileRepository and ObjectDirectory in order to
just start the refactoring. In later changes we will clean up the
APIs more, exposing sufficient support to PackWriter without needing
the file specific implementation details.
Change-Id: I676be12b57f3534f1285854ee5de1aa483895398
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Some newer style APIs are updated to use the newer ObjectInserter
interface instead of the now deprecated ObjectWriter. In many of
the unit tests we don't bother to release the inserter, these are
typically using the file backend which doesn't need a release,
but in the future should use an in-memory HashMap based store,
which really wouldn't need it either.
Change-Id: I91a15e1dc42da68e6715397814e30fbd87fa2e73
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The ObjectInserter API permits ObjectDatabase implementations to
control their own object insertion behavior, rather than forcing
it to always be a new loose file created in the local filesystem.
Inserted objects can also be queued and written asynchronously to
the main application, such as by appending into a pack file that
is later closed and added to the repository.
This change also starts to open the door to non-file based object
storage, such as an in-memory HashMap for unit testing, or a more
complex system built on top of a distributed hash table.
To help existing application code port to the newer interface we
are keeping ObjectWriter as a delegation wrapper to the new API.
Each ObjectWriter instances holds a reference to an ObjectInserter
for the Repository's top-level ObjectDatabase, and it flushes and
releases that instance on each object processed.
Change-Id: I413224fb95563e7330c82748deb0aada4e0d6ace
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When the surrounding code is already heavily based upon the
assumption that we have a FileRepository (e.g. because it
created that type of repository) keep the type around and
use it directly. This permits us to continue to do things
like save the configuration file.
Change-Id: Ib783f0f6a11acd6aa305c16d61ccc368b46beecc
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
To support other storage models other than just the local filesystem,
we split the Repository class into a nearly abstract interface and
then create a concrete subclass called FileRepository with the file
based IO implementation.
We are using an abstract class for Repository rather than the much
more generic interface, as implementers will want to inherit a large
array of utility functions, such as resolve(String). Having these in
a base class makes it easy to inherit them.
This isn't the final home for lib.FileRepository. Future changes
will rename it into storage.file.FileRepository, but to do that we
need to also move a number of other related class, which we aren't
quite ready to do.
Change-Id: I1bd54ea0500337799a8e792874c272eb14d555f7
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Rather than relying on the helpers in RepositoryConfig to get
these objects, obtain them directly through the Config API.
Its only slightly more verbose, but permits us to work with the
base Config class, which is more flexible than the highly file
specific RepositoryConfig.
This is what I really meant to do when I added the section parser
and caching support to Config, we just failed to finish updating
all of the call sites.
Change-Id: I481cb365aa00bfa8c21e5ad0cd367ddd9c6c0edd
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
We don't have to assume/depend on RepositoryConfig here, these
two tests can use higher level versions of the class and still
come up with the same test. That frees us up to do some changes
to the RepositoryConfig API.
Change-Id: Ia7b263c8c5efa3fae1054416d39c546867288132
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Long ago we stopped supporting the core.legacyHeaders variable,
as JGit (like C Git) stopped creating the new pack-style loose
objects, rendering this variable pointless. The test is still
valid, it proves we write the standard loose object format for
a commit, but the variable assignment has no impact on the test
so drop it from the code.
Change-Id: I051336ada23033c05e86bbff73ae5d78a37b1640
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This stream was used only to determine how many bytes had been
written thus far. Except we're always dumping it into a simple
ByteArrayOutputStream, which also knows that. Drop the dependency
on the pack stream and use ByteArrayOutputStream directly.
This lets us later move this test into the new storage.file
package without dragging along the pack stream that is an internal
implementation detail of PackWriter, which is more general than
just the file storage layer.
Change-Id: I291689c0b1ed799270c213ee73b710b2637fb238
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Setting this value is pointless, because its automatically set
by the refs.newUpdate call that created the update operation.
The API is protected by default, because application level code,
including this test, should not be calling it.
Change-Id: I8867a4e8007892e2bd44a05d7dec619081081943
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The mapCommit API is being deprecated because it doesn't run very
fast. Leaving tests around to test how fast it is relative to C Git
isn't instructive. Remove them, which should help aid the transition
away from the mapCommit API.
Change-Id: I27e1c844610d7da5b2c44b33a00602706973c9cc
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Some sources had dos line endings. Also configure all projects to use
unix line endings and UTF-8 text encoding.
Change-Id: I8fc9a1dbb219ffa91d1b3011b3b11b7e48e74ca7
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
If a repository is "bare", it currently still returns a working directory.
This conflicts with the specification of "bare"-ness.
Bug: 311902
Change-Id: Ib54b31ddc80b9032e6e7bf013948bb83e12cfd88
Signed-off-by: Mathias Kinzler <mathias.kinzler@sap.com>
Currently, there is no way to read the content
of the Git Configuration in a way that would
allow to list all configured values generically.
This change extends the Config class in such a
way as to being able to get a list of sections and
to get a list of names for any given section or
subsection.
This is required in able to implement proper
configuration handling in EGit (show all the
content of a given configuration similar to
"git config -l").
Change-Id: Idd4bc47be18ed0e36b11be8c23c9c707159dc830
Signed-off-by: Mathias Kinzler <mathias.kinzler@sap.com>
Created wrong tags for 0.8.3 hence creating another version.
Change-Id: I4e00bbcffe1cf872e2d7e3f3d88d068701fb5330
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
On Windows, FS_Win32_Cygwin has been used if a Cygwin Git installation
is present in the PATH. Assuming that the user works with the Cygwin
Git installation may result in unnecessary overhead if he actually
does not.
Applications built on top of jgit may have more knowledge on the
actually used Git client (Cygwin or not) and hence should be able to
configure which FS to use accordingly.
Change-Id: Ifc4278078b298781d55cf5421e9647a21fa5db24
A Change-Id helps tools like Gerrit Code Review to keeps different
versions of a patch together. The Change-Id is computed as a SHA-1
hash of some of the same basic information as a commit id on the first
commit intended to solve a particular problem and then reused for
updated solutions.
Change-Id: I04334f84e76e83a4185283cb72ea0308b1cb4182
Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>
ReadTreeTest contains a lot of useful tests for "checkout"
implementations. But ReadTreeTest was hardcoded to test only
WorkDirCheckout. This change doesn't add/modify any tests semantically
but refactors ReadTreeTest so that a different implementations of
checkout can be tested. This was done to allow DirCacheCheckout to be
tested without rewriting all these tests.
Change-Id: I36e34264482b855ed22c9dde98824f573cf8ae22
Signed-off-by: Christian Halstrick <christian.halstrick@sap.com>
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>