The FollowFilter can be installed on a RevWalk to cause the path
to be updated through rename detection when the affected file is
found to be added to the project.
The filter works reasonably well, for example we can follow the
history of the fsck command in git-core:
$ jgit log --name-status --follow builtin/fsck.c | grep ^R
R100 builtin-fsck.c builtin/fsck.c
R099 fsck.c builtin-fsck.c
R099 fsck-objects.c fsck.c
R099 fsck-cache.c fsck-objects.c
Change-Id: I4017bcfd150126aa342fdd423a688493ca660a1f
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This way we don't have to reparse for the rename limit every time
we create a new rename detector for a repository.
Change-Id: I669d031690b85ef4da5e39189be7173fb773fc56
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Similar to what we did with diff, implement whitespace ignore options
for log too. This requires us to define some means of creating any
RawText object type at will inside of DiffFormatter, so we define a
new factory interface to construct RawText instances on demand.
Unfortunately we have to copy the entire block of common options.
args4j only processes the options/arguments on the one command class
and Java doesn't support multiple inheritance.
Change-Id: Ia16cd3a11b850fffae9fbe7b721d7e43f1d0e8a5
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Instead of crashing, output a submodule link with the simple
"Subproject commit $fullid\n" syntax used by C Git.
Change-Id: Iae8646941683fb19b73fb038217d2e3bf5f77fa9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Passing around the OutputStream and the Repository is crazy. Instead
put the stream in the constructor, since this formatter exists only to
output to the stream, and put the repository as a member variable that
can be optionally set.
Change-Id: I2bad012fee7f40dc1346700ebd19f1e048982878
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Implement rename detection in the command line diff and log commands.
Also support --name-status, -p and -U flags, as these can be quite
useful to view more detail.
All of the Git patch file formatting code is now moved over to the
DiffFormatter class. This permits us to reuse it in any context,
including inside of IDEs.
Change-Id: I687ccba34e18105a07e0a439d2181c323209d96c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Content similarity based rename detection is performed only after
a linear time detection is performed using exact content match on
the ObjectIds. Any names which were paired up during that exact
match phase are excluded from the inexact similarity based rename,
which reduces the space that must be considered.
During rename detection two entries cannot be marked as a rename
if they are different types of files. This prevents a symlink from
being renamed to a regular file, even if their blob content appears
to be similar, or is identical.
Efficiently comparing two files is performed by building up two
hash indexes and hashing lines or short blocks from each file,
counting the number of bytes that each line or block represents.
Instead of using a standard java.util.HashMap, we use a custom
open hashing scheme similiar to what we use in ObjecIdSubclassMap.
This permits us to have a very light-weight hash, with very little
memory overhead per cell stored.
As we only need two ints per record in the map (line/block key and
number of bytes), we collapse them into a single long inside of
a long array, making very efficient use of available memory when
we create the index table. We only need object headers for the
index structure itself, and the index table, but not per-cell.
This offers a massive space savings over using java.util.HashMap.
The score calculation is done by approximating how many bytes are
the same between the two inputs (which for a delta would be how much
is copied from the base into the result). The score is derived by
dividing the approximate number of bytes in common into the length
of the larger of the two input files.
Right now the SimilarityIndex table should average about 1/2 full,
which means we waste about 50% of our memory on empty entries
after we are done indexing a file and sort the table's contents.
If memory becomes an issue we could discard the table and copy all
records over to a new array that is properly sized.
Building the index requires O(M + N log N) time, where M is the
size of the input file in bytes, and N is the number of unique
lines/blocks in the file. The N log N time constraint comes
from the sort of the index table that is necessary to perform
linear time matching against another SimilarityIndex created for
a different file.
To actually perform the rename detection, a SxD matrix is created,
placing the sources (aka deletions) along one dimension and the
destinations (aka additions) along the other. A simple O(S x D)
loop examines every cell in this matrix.
A SimilarityIndex is built along the row and reused for each
column compare along that row, avoiding the costly index rebuild
at the row level. A future improvement would be to load a smaller
square matrix into SimilarityIndexes and process everything in that
sub-matrix before discarding the column dimension and moving down
to the next sub-matrix block along that same grid of rows.
An optional ProgressMonitor is permitted to be passed in, allowing
applications to see the progress of the detector as it works through
the matrix cells. This provides some indication of current status
for very long running renames.
The default line/block hash function used by the SimilarityIndex
may not be optimal, and may produce too many collisions. It is
borrowed from RawText's hash, which is used to quickly skip out of
a longer equality test if two lines have different hash functions.
We may need to refine this hash in the future, in order to minimize
the number of collisions we get on common source files.
Based on a handful of test commits in JGit (especially my own
recent rename repository refactoring series), this rename detector
produces output that is very close to C Git. The content similarity
scores are sometimes off by 1%, which is most probably caused by
our SimilarityIndex type using a different hash function than C
Git uses when it computes the delta size between any two objects
in the rename matrix.
Bug: 318504
Change-Id: I11dff969e8a2e4cf252636d857d2113053bdd9dc
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
JGit does not currently do rename detection during diffs. I added
a class that, given a TreeWalk to iterate over, can output a list
of DiffEntry's for that TreeWalk, taking into account renames. This
class only detects renames by SHA1's. More complex rename detection,
along the lines of what C Git does will be added later.
Change-Id: I93606ce15da70df6660651ec322ea50718dd7c04
Refactored a superclass out of FileHeader called DiffEntry that holds
the more general data from FileHeader that is useful in rename
detection (old/new Ids, modes, names, as well as changeType and
score). FileHeader is now a DiffEntry that adds Hunks, parsing
abilities, etc.
Change-Id: I8398728cd218f8c6e98f7a4a7f2f342391d865e4
It is possible that StreamCopyThread will not flush everything
from it's src to it's dst. In most cases StreamCopyThread works
like this:
in loop:
n = src.read(buf);
dst.write(buf, 0, n);
and when we want to flush, we interrupt() StreamCopyThread and it
flushes everything it wrote to dst.
The problem is that our interrupt() could interrupt reading. In this
case we will flush everything we wrote to dst, but not everything
we wrote to src.
Change-Id: Ifaf4d8be87535c7364dd59b217dfc631460018ff
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Added a check in Diff to ensure that files that are most likely
not text are not line-by-line diffed. Files are determined to be
binary by checking the first 8000 bytes for a null character. This
is a similar heuristic to what C Git uses.
Change-Id: I2b6f05674c88d89b3f549a5db483f850f7f46c26
Added code to support ignoring leading, trailing, and changed
whitespace when performing a diff operation. I also added command
line options to Diff to enable the various whitespace ignoring
methods. These match the flags for git diff.
Change-Id: Ie56301aafad59ee3f0fe5de62719f5023cd702c8
JGit did not have support for skipping whitespace when comparing
lines in RawText objects. I added a subclass of RawText that skips
whitespace in its equals and hashCode methods. I used a subclass
rather than adding functionality into RawText so that performance
would not be impacted by extra logic.
This class only supports ignoring all whitespace. Others will follow
that allow other forms of whitespace ignoring.
Change-Id: Ic2f79e85215e48d3fd53ec1b4ad13373dd183a4a
Under smart HTTP the biDirectionalPipe flag is false, and we return
back immediately at this point in the negotiation process. There is
no need to flush the stream to the client, the request is over and
it will be automatically flushed out by the higher level servlet
that invoked us. Avoiding flush here allows us to only use flush
after a progress message is sent during pack generation.
Change-Id: Id0c8b7e95e3be6ca4c1b479e096bed6b0283b828
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This simplifies the PackIndex code, which is trying to quickly copy
an existing ObjectId into a MutableObjectId. Rather than having
the PackIndex violate the ObjectId's internals, expose a copy from
function similar to the other ones for copying from raw byte arrays
or hex formatted strings.
Change-Id: I142635cbece54af2ab83c58477961ce925dc8255
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Storage systems can use these implementations to compare a passed
AnyObjectId with a stored representation of an ObjectId in the
canonical network byte order format. This can be useful to do a
binary search, or just linear scan, over an encoded storage file.
Change-Id: I8c72993c4f4c6e98d599ac2c9867453752f25fd2
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
An implementation might prefer to use the RefList type here, and
RefList is part of our public API. Expose the constructor so callers
who have a RefList can take advantage of the existing sorting.
Change-Id: I545867f85aa2c479d2d610024ebbe318144709c8
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When we finally move RefDirectory to the new storage.file package,
its associated RefDirectoryUpdate will need visiblity to this
constructor in order to initialize itself. This is true of any
other repository implementation, so make it protected rather than
package level visible.
Change-Id: If838aec9baeb80ee2f12dcbca717657c725a9242
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Repository implementations outside of .lib need to be able to
create these events and deliver them to listening application code.
Expose and document the constructors so that they are visible when
we move FileRepository into storage.file.FileRepository.
Change-Id: I7fb6e8f4f5fdab683c5ebb5267673aa6d5b560bb
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
A Git reference name must never end with ".lock", as it would
confuse any existing C client that tries to obtain a clone of the
repository over the network. Even if the repository isn't on a
local filesystem, it still should ban that suffix.
Because I plan to move LockFile to storage.file and make it a private
implementation detail of the local file system storage model,
we can't rely on its package level SUFFIX field here. Making it
public probably won't work long-term either, as I also plan to
pull storage.file into its own separate project that depends on
the core library.
So, just inline the constant here. Its as foribidden as ":" is.
Change-Id: If85076861baeacc183b82696375a13e935ba8836
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This stream was used only to determine how many bytes had been
written thus far. Except we're always dumping it into a simple
ByteArrayOutputStream, which also knows that. Drop the dependency
on the pack stream and use ByteArrayOutputStream directly.
This lets us later move this test into the new storage.file
package without dragging along the pack stream that is an internal
implementation detail of PackWriter, which is more general than
just the file storage layer.
Change-Id: I291689c0b1ed799270c213ee73b710b2637fb238
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Setting this value is pointless, because its automatically set
by the refs.newUpdate call that created the update operation.
The API is protected by default, because application level code,
including this test, should not be calling it.
Change-Id: I8867a4e8007892e2bd44a05d7dec619081081943
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The mapCommit API is being deprecated because it doesn't run very
fast. Leaving tests around to test how fast it is relative to C Git
isn't instructive. Remove them, which should help aid the transition
away from the mapCommit API.
Change-Id: I27e1c844610d7da5b2c44b33a00602706973c9cc
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Some sources had dos line endings. Also configure all projects to use
unix line endings and UTF-8 text encoding.
Change-Id: I8fc9a1dbb219ffa91d1b3011b3b11b7e48e74ca7
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Modifications to various classes in order to allow serialization
for use of JGit in Hudson's git plugin.
Change-Id: If088717d3da7483538c00a927e433a74085ae9e6
If a repository is "bare", it currently still returns a working directory.
This conflicts with the specification of "bare"-ness.
Bug: 311902
Change-Id: Ib54b31ddc80b9032e6e7bf013948bb83e12cfd88
Signed-off-by: Mathias Kinzler <mathias.kinzler@sap.com>
Currently, there is no way to read the content
of the Git Configuration in a way that would
allow to list all configured values generically.
This change extends the Config class in such a
way as to being able to get a list of sections and
to get a list of names for any given section or
subsection.
This is required in able to implement proper
configuration handling in EGit (show all the
content of a given configuration similar to
"git config -l").
Change-Id: Idd4bc47be18ed0e36b11be8c23c9c707159dc830
Signed-off-by: Mathias Kinzler <mathias.kinzler@sap.com>
* changes:
git-servlet: Fix comparing uploadFactory with the wrong DISABLED instance
Prefer static inner classes
Override equals for SwingLane since super class PlotLane defines it
Make sure a Stream is closed upon errors in IpLogGenerator
Make constant static in RebuildCommitGraph
Make inner classes static in http code
Cache filemode in GitIndex
Remove unused parent field in PlotLane
Removed unused repo field in WorkDirCheckout
Extend DiffFormatter API to simplify styling
Windows doesn't permit us to edit a file in-place with Perl.
So create backup files when we perform the edit, and remove them
when we are done. This is a tad slower on POSIX systems, but is
much more portable.
Change-Id: I429c7d698924cb32e709363f5da82f7232bbdab2
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* stable-0.8:
Qualify post-0.8.4 builds
JGit 0.8.4
JGit 0.8.3
Include about.html in org.eclipse.jgit artifact
Fix build.properties of the JGit feature
Added the standard SULA for JGit
Add "resources/" as a source folder
Change-Id: I4ecb0af41184ef84d104345fd1adcc4a240a38f6
Created wrong tags for 0.8.3 hence creating another version.
Change-Id: I4e00bbcffe1cf872e2d7e3f3d88d068701fb5330
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Static classes are preferrable to keep unwanted dependencies away,
and they have one less member field.
Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>