diff notes.txt @ 0:9117c6561b0b

Add back links from file revisions to changeset revisions Add simple transaction support Add hg verify Improve caching in revlog Fix a bunch of bugs Self-hosting now that the metadata is close to finalized
author mpm@selenic.com
date Tue, 03 May 2005 13:16:10 -0800
parents
children 45cc818f2445
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/notes.txt	Tue May 03 13:16:10 2005 -0800
@@ -0,0 +1,159 @@
+Some notes about Mercurial's design
+
+Revlogs:
+
+The fundamental storage type in Mercurial is a "revlog". A revlog is
+the set of all revisions to a file. Each revision is either stored
+compressed in its entirety or as a compressed binary delta against the
+previous version. The decision of when to store a full version is made
+based on how much data would be needed to reconstruct the file. This
+lets us ensure that we never need to read huge amounts of data to
+reconstruct a file, regardless of how many revisions of it we store.
+
+In fact, we should always be able to do it with a single read,
+provided we know when and where to read. This is where the index comes
+in. Each revlog has an index containing a special hash (nodeid) of the
+text, hashes for its parents, and where and how much of the revlog
+data we need to read to reconstruct it. Thus, with one read of the
+index and one read of the data, we can reconstruct any version in time
+proportional to the file size.
+
+Similarly, revlogs and their indices are append-only. This means that
+adding a new version is also O(1) seeks.
+
+Generally revlogs are used to represent revisions of files, but they
+also are used to represent manifests and changesets.
+
+Manifests:
+
+A manifest is simply a list of all files in a given revision of a
+project along with the nodeids of the corresponding file revisions. So
+grabbing a given version of the project means simply looking up its
+manifest and reconstruction all the file revisions pointed to by it.
+
+Changesets:
+
+A changeset is a list of all files changed in a check-in along with a
+change description and some metadata like user and date. It also
+contains a nodeid to the relevent revision of the manifest. Changesets
+and manifests are one-to-one, but contain different data for
+convenience.
+
+Nodeids:
+
+Nodeids are unique ids that are used to represent the contents of a
+file AND its position in the project history. That is, if you change a
+file and then change it back, the result will have a different nodeid
+because it has different history. This is accomplished by including
+the parents of a given revision's nodeids with the revision's text
+when calculating the hash.
+
+Graph merging:
+
+Nodeids are implemented as they are to simplify merging. Merging a
+pair of directed acyclic graphs (aka "the family tree" of the file
+history) requires some method of determining if nodes in different
+graphs correspond. Simply comparing the contents of the node (by
+comparing text of given revisions or their hashes) can get confused by
+identical revisions in the tree.
+
+The nodeid approach makes it trivial - the hash uniquely describes a
+revision's contents and its graph position relative to the root, so
+merge is simply checking whether each nodeid in graph A is in the hash
+table of graph B. If not, we pull them in, adding them sequentially to
+the revlog.
+
+Graph resolving:
+
+Mercurial does branching by copying (or COWing) a repository and thus
+keeps everything nice and linear within a repository. However, when a
+merge of repositories (a "pull") is done, we may often have two head
+revisions in a given graph. To keep things simple, Mercurial forces
+the head revisions to be merged.
+
+It first finds the closest common ancestor of the two heads. If one is
+a child of the other, it becomes the new head. Otherwise, we call out
+to a user-specified 3-way merge tool.
+
+Merging files, manifests, and changesets:
+
+We begin by comparing changeset DAGs, pulling all nodes we don't have
+in our DAG from the other repository. As we do so, we collect a list
+of changed files to merge.
+
+Then for each file, we perform a graph merge and resolve as above.
+It's important to merge files using per-file DAGs rather than just
+changeset level DAGs as this diagram illustrates: 
+
+M   M1   M2
+
+AB
+ |`-------v     M2 clones M
+aB       AB     file A is change in mainline
+ |`---v  AB'    file B is changed in M2
+ |   aB / |     M1 clones M
+ |   ab/  |     M1 changes B
+ |   ab'  |     M1 merges from M2, changes to B conflict
+ |    |  A'B'   M2 changes A
+  `---+--.|
+      |  a'B'   M2 merges from mainline, changes to A conflict
+      `--.|
+         ???    depending on which ancestor we choose, we will have
+	        to redo A hand-merge, B hand-merge, or both
+                but if we look at the files independently, everything
+		is fine
+
+After we've merged files, we merge the manifest log DAG and resolve
+additions and deletions. Then we are ready to resolve the changeset
+DAG - if our merge required any changes (the new head is not a
+decendent of our tip), we must create a new changeset describing all
+of the changes needed to merge it into the tip.
+
+Merge performance:
+
+The I/O operations for performing a merge are O(changed files), not
+O(total changes) and in many cases, we needn't even unpack the deltas
+to add them to our repository (though this optimization isn't
+necessary).
+
+Rollback:
+
+Rollback is not yet implemented, but will be easy to add. When
+performing a commit or a merge, we order things so that the changeset
+entry gets added last. We keep a transaction log of the name of each
+file and its length prior to the transaction. On abort, we simply
+truncate each file to its prior length. This is one of the nice
+properties of the append-only structure of the revlogs.
+
+Remote access:
+
+Mercurial currently supports pulling from "serverless" repositories.
+Simply making the repo directory accessibly via the web and pointing
+hg at it can accomplish a pull. This is relatively bandwidth efficient
+but no effort has been spent on pipelining, so it won't work
+especially well over LAN yet.
+
+It's also quite amenable to rsync, if you don't mind keeping an intact
+copy of the master around locally.
+
+Also note the append-only and ordering properties of the commit
+guarantee that readers will always see a repository in a consistent
+state and no special locking is necessary. As there is generally only
+one writer to an hg repository, there is in fact no exclusion
+implemented yet. 
+
+
+Some comparisons to git:
+
+Most notably, Mercurial uses delta compression and repositories
+created with it will grow much more slowly over time. This also allows
+it to be much more bandwidth efficient. I expect repos sizes and sync
+speeds to be similar to or better than BK, given the use of binary diffs.
+
+Mercurial is roughly the same performance as git and is faster in
+others as it keeps around more metadata. One example is listing and
+retrieving past versions of a file, which it can do without reading
+all the changesets. This metadata will also allow it to perform better
+merges as described above.
+
+