Hm, I was hoping there would be higher-level differences than that. Like maybe the notion of having an index and staging area would be fundamentally different (or absent), or the notion of committing locally and pushing remotely would be fundamentally different, or something like that. Which is not to say command names aren't also confusing in git ('git add' to remove a file is my favorite), but they're not really where I imagined the substantial differences would lie?
> the notion of having an index and staging area would be fundamentally different (or absent)
Mercurial indeed lacks a staging area
> the notion of committing locally and pushing remotely would be fundamentally different
If you enable phases, it does. By default, all local commits are "draft" commits, which can have history edited. But when you push to (or pull from) a public repository, it becomes a "public" commit, and Mercurial will refuse to edit those commits. There's also a "secret" phase, which Mercurial will refuse to push to a public repository. You can also configure whether or not remote repositories are public or not.
The other fun feature is changeset evolution: when you rebase a changeset, Mercurial will keep the original changeset around with a link between the old and new versions of it. If you push the rebase remotely to a repository that previously had the original, Mercurial will tell people that changesets based on the original also need to be rebased when they pull it, and the history information will let them use the regular rebase flow to do it. Too bad BitBucket is dropping hg repos...
I'm sure someone will dispute that it is the same as the staging area, but mercurial does have the ability to only include some things in commits, and I use it daily. I admit I have no idea how it works on the command line, but in tortoisehg I can select individual files and hunks thereof to add to a commit (screenshot: https://imgur.com/ErPEjSN).
It's my impression that loads of people use this feature and somehow don't notice it exists, and will openly agree that mercurial lacks a staging area. My opinion is that defaulting to including changes from all files previously committed is such a sensible default that it makes the feature so seamless that people forget it exists even whilst using it.
Kinda like how people from the US think you can't turn left on red in Australia (equivalent of right on red in the US). You can, but the existence of slip lanes everywhere its allowed mean you don't actually face the red light when doing so. This hides it so effectively that I've seen several people complaining that they can't turn on red, not noticing that they do it every day.
Mecurial lets you choose which files you want to commit on the command line, either by simply specifying all the relevant globs, or by using -I (include) and -X (exclude) flags to compose a more complex pattern matching. You can also use interactive commit mode to open up a TUI to select which files you want to load. Alternatively, you can also add stuff to the top commit with `hg commit --amend`. git has basically all of these features.
What Mercurial doesn't have is a separate repository concept of an incomplete commit that has awkward, and often inconsistent, interactions with several common repository commands. Git has this, and this is the staging area.
The UI feature you use has nothing to do with the staging area, and is in fact closer to the `hg commit -i` interactive commit mode.
This makes for one difference which sometimes bites people even if they use the two tools in the same way: if you change a file "while committing it", e.g. changing it in one window while writing the commit message in another, Mercurial will commit the changed version, while Git of course commits the file as it was when you staged it, or when you ran commit -a.
I find this quite handy in Mercurial, as I sometimes realise while writing the commit message that e.g. I forgot to remove a line of temporary debug output, and I can still go and do it before I save the commit message. But that's very bad form that reflects badly on me. The way Git does it feels more correct.
Yes, hg's interactive commit allows this (and if you look at my screenshot, you can see tortoisehg presents checkboxes next to hunks in the diff, this is the way I use it)
Thinking back about my Git usage, I've indeed only ever used it as a sort of poor man's "hg commit --interactive", so I haven't missed it a bit when using Mercurial. Indeed I find Mercurial's model (there's only the repository with the real changesets and the working directory, but not this sort of weird inbetween thing called the staging area) much easier to reason about...
> Mercurial does have the ability to only include some things in commits
So do p4 and svn, through "changelists". The difference is that only in git is the staging area "persistent" from one command to another, and hidden from you when you run the default "what are the differences between my working directory and the head" command.
Mercurial does have a staging area, it is just turned off by default (and this is a good thing).
Mercurial has a very good extension system, and ships several useful extension in the default installation. The 'record' extension adds the 'commit individual hunks' functionality of the git staging area, and needs a one line "record=" in your config file to enable it. See: https://www.mercurial-scm.org/wiki/RecordExtension
By having some of these functionalities in extensions, the core interface is kept simpler and easier to learn, but the more powerful features are still there for those who want them.
I don't believe hg has an equivalent of the git index, although there are other ways to get similar functionality if you want it. The repository formats are different, of course, but it's easy enough to convert from one to the other. Git branches are mutable pointers that don't get recorded in the commit history, while in hg each commit records the branch it was committed to. hg has "bookmarks" that work like git's branches; I don't think git has any equivalent of hg-style branches.
For the most part, you can do anything with either one that you could do with the other, and the practical differences are mostly just in the UI, as well as a cultural component (e.g. hg more strongly discouraging history rewriting is part UI, part culture). Personally, I prefer the git UI because it has a more direct mapping onto manipulating a DAG of commits, but obviously not everyone feels that way, and I don't expect them to. When I first started with DVCS, I started with hg since it seemed easier to learn.
(Although really, my DVCS UI of choice is Magit, which I use for 99.9% of git stuff, to the point where I almost never need to actually run git myself on the command line.)
>> more direct mapping onto manipulating a DAG of commits
That would be a leaky abstraction. The fact that is using DAG should be irrelevant. If they find a better way of doing the work, that shouldn't really affect the ux.
I'm not sure I follow. A git history is literally a DAG of commits, and if you understand what each command is doing in terms of that DAG, there's very little abstraction at all, so there's nothing to leak.
The general consensus seems to be that git is an excellent semantic model wrapped in a ... mostly acceptable command-line UI. Even most big git fans aren't going to deny that the UI has some big warts (I'm a fan of git myself, there's _definitely_ things I'd change in a perfect world).
- [Major] I feel (but could be potentially convinced otherwise?) that there is one very deep fundamental flaw in the semantic model, and that is the fact that the identity of a commit depends on its history. I simply do not understand why this has to be the case. If I later discover a ZIP backup of the tree that I forgot to commit, and I want to insert it into the history, it shouldn't suddenly completely break the entire repo. Of course it seems fine to have a hash that depends on the history, and it's very likely useful for many purposes, but that shouldn't be the primary mechanism for identifying commits. By default, I think the identity of a commit should be defined by a hash of its contents only, but independent of its history. This would (among other things) let you re-write the history structure without rewriting the commits themselves and causing other people to have to reset their repos, which seems insanely useful to me.
I think the reason for why the commit hash has to change is that a commit represents the entire state of a repository, not just the change made in the commit. Being able to take a sequence of commits and insert them into a repository just is not a thing that makes sense in git's model.
If you just hashed diffs, you would not get whole-repo integrity guarantees.
It is possible to go the other way with patch theory (see Darcs) but it's far from trivial to implement performantly.
> I think the reason for why the commit hash has to change is that a commit represents the entire state of a repository, not just the change made in the commit.
Yes, of course I realize that's the reason. My entire point was that a commit shouldn't represent the entire state of a repository.
> Being able to take a sequence of commits and insert them into a repository just is not a thing that makes sense in git's model.
Yes, and this is exactly why I declared this to be a fundamental flaw in git's model.
> If you just hashed diffs
Diffs are an implementation concern, which I don't care about. I'm only talking about the logical semantics.
> you would not get whole-repo integrity guarantees.
As I explained, I wasn't suggesting you must get rid of that hash entirely: "Of course it seems fine to have a hash that depends on the history, and it's very likely useful for many purposes, but that shouldn't be the primary mechanism for identifying commits."
> It is possible to go the other way with patch theory (see Darcs) but it's far from trivial to implement performantly.
Again, I didn't say you have to get rid of the current hashes. I was just saying we need something else to use for identifying commits.
------
If an example helps: consider what happens when you (say) sign off on a commit. Are you genuinely signing off on the history? Can you even claim with a straight face that you even know everything in the history behind every commit you sign off on? The reality is, you don't, and you don't need to, because you're only concerned about the commit itself. There's no reason a change in history should invalidate your signature. (Of course, the point here is not just signatures. They're just one example to illustrate what I'm saying. You can think of other scenarios.)
No, you are signing off current state of the repository. Otherwise it would be possible (not trivial, but possible) to take signed commit and apply it on different history, which could create a security loophole.
Your view on commit is a logical set of changes. Git's view is state of the repository. The set of changes between revisions, which is useful for developer to see more than the whole state, is computed on the fly.
>I was just saying we need something else to use for identifying commits.
> Your view on commit is a logical set of changes. Git's view is state of the repository.
No, my view of a commit is not a logical set of changes. It's everything that would be in my worktree if I checked out the commit. Which is neither merely the changes from the previous commit(s), nor the entire history leading to the current commit.
But git already has this object, it’s called a tree and each commit has a unique tree associated with it. The commits are the object that carries history and metadata on top of the trees. Is your objection that the commit metadata is associated to the commit and not the tree?
I used to think that, but life gets complicated. How do you transmit a commit with its history? It used to be you just sent a single hash, now you would have to send all the commits of the whole history. Also how would you merge repositories with different histories?
I spend some time thinking about and I couldn't come up with anything sensible which wouldn't lead to history being effectively brokenm
I'm not sure I understand what the problem is. I'm only talking about the logical objects, not the physical representation. You can still store diffs and you can still have history hashes if that helps you with storage, processing, etc. -- that's perfectly fine. The storage optimizations should be independent of the logical structure. I'm just saying the logical identity of a commit shouldn't depend on its history. For example, if someone removes a commit from the history, that shouldn't have to trash anyone's repo and be such a massively destructive operation. It should only cause a client (in the worst case) to resync its history hashes from that point onward the next time it pulls -- which is quite a cheap, fast, and non-intrusive operation. (Say, 100k commits with 20B SHA-1 hashes would just be ~2MB.)
I'm not sure I understand. How could removing a commit from the history not be a destructive operation? It would necessarily affect every commit after it, hashes or no hashes, because for each commit following, the state of the tree would change, and thus so would the commit.
To my mind It would be akin to walking across the room and then somehow changing things such that you took one step fewer than required.
I'm not sure what kind of implementation you're envisioning that could work the way you seem to describe. Or do you mean that git should save the entire state of the repository as independent blobs every time you commit something? I don't think you could do that with any hope of reasonable performance.
If you instead just allowed "removing" commits logically without actually physically altering the datastructure on disk, there's no point in providing the functionality in the first place.
> If you instead just allowed "removing" commits logically without actually physically altering the datastructure on disk
Yes
> there's no point in providing the functionality in the first place.
Why so dismissive? Wouldn't it make sense to give me the benefit of the doubt here and ask me what the point of something like this might be, instead of just shutting down it down as pointless? Unless you think I'm just dumb, or otherwise trying to troll here by asking for something pointless?
I am not being dismissive. If you provide functionality that allows the user to delete something without actually deleting it, what's the point of pretending that you can delete things? Usually when people want to delete commits, it's because they committed something like a secret, and really do want to delete it.
Git doesn't try to hide the fact that the committed data is immutable, and to accomplish "deletion" the only option is to rewrite the entire affected part of the datastructure and garbage-collect anything that's unreferenced. You can not modify a commit. You can only create new commits and manipulate references to them.
This is fundamentally what enables git to function in a distributed manner, since the only state between repositories that needs special logic are the references; the actual data could be blindly synced with rsync or something, because it practically speaking can't ever conflict.
In order to have useful global non-hash commit identifiers, you would need a separate data structure of references that somehow decides which commits are identical, and is capable of reconciling conflicts globally across all clones of a git repository. I'm pretty sure that this isn't even in theory possible for the general case.
As for signoffs, a change in history might make a change you signed off broken or completely irrelevant, so yes, I do think that a change in history can invalidate a signoff on a commit.
What you ask for already exists: it's called the tree hash (which can be obtained by doing `git log -n1 --pretty="%T"`). The tree hash is unaffected by history, so if you for instance revert a commit, the tree hash will also revert. IIRC Julia uses tree hashes rather than commit hashes to track its packages.
I'm most definitely not suggesting we should be using patches instead of commits though. I don't want anything to be logically composed of patches at all. (Physical-storage-wise, they can go wild; I don't care.)
Ah, I misunderstood you. I thought you were asking for the identity of a commit to be the identity of the contents of that commit, i.e. the changes - but it seems you're talking about the contents of the working tree at the moment of the commit, with no dependency on prior history.
The thing is, the contents of a commit aren't patches. They are snapshots of the worktree. Your mental model is wrong (sorry), that's why you misunderstood. :-)
This is a common misconception that is corrected in many blog posts and tutorials; it's also explained clearly in the documentation. See the section (aptly) titled "Snapshots, Not Differences", where it says [1]:
"The major difference between Git and any other VCS (Subversion and friends included) is the way Git thinks about its data. Conceptually, most other systems store information as a list of file-based changes. These systems (CVS, Subversion, Perforce, Bazaar, and so on) think of the information they keep as a set of files and the changes made to each file over time. [...] Git doesn’t think of or store its data this way. Instead, Git thinks of its data more like a set of snapshots of a mini filesystem. Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot."
Now of course as an implementation detail it only stores diffs based on existing blobs, but except for the obvious speed difference, this fact is completely irrelevant to you as a user. You neither know nor care how it is actually storing its commits. And the thing is, even if you looked underneath, you would have absolutely no guarantee that the blobs are physically stored as diffs against the parent commits. They might be stored as diffs against other random blobs the repo for efficiency, and the user would be none the wiser.
What's the difference, though? How are patches different from commuting commits? By commuting I mean commits that do not depend on their position in history.
> - [Major] I feel (but could be potentially convinced otherwise?) that there is one very deep fundamental flaw in the semantic model, and that is the fact that the identity of a commit depends on its history.
Commits whose identity does not depend on their position in history are commits that are commutative (with respect to their position in history). So you very much said so, but we obviously appear to be talking about different things. I'm at a loss as to where these things differ.
What? This is like saying you and your younger brother are commutative. It makes no sense. Commits are snapshots, not diffs. i.e. they're variables, not operations. i.e. they're verbs, not nouns. They're as commutative as you and I are.
Oh, I see what you're saying now, I think. You're arguing for commits to completely lose any relationship with one another by default while remaining simple snapshots. I didn't realize this at first since I fail to see the immediate utility of this.
I agree the concept of a standalone snapshot is useful, but I don't think snapshots are the right abstraction when thinking about the evolution of a codebase from a human perspective and consider changes the more important concept.
I mean, the idea that commits are diffs is a (common) misconception about git, likely carried over from another VCS. The snapshot model is the current abstraction; I haven't added any idea of my own here here. It's right there in the documentation: https://git-scm.com/book/en/v2/Getting-Started-What-is-Git%3...
But I never said changes aren't important and should be neglected. And I'm also not saying there shouldn't be any relationship about commits' relationships to each other. You certainly can and should record and utilize that information as well. It just shouldn't be part of that commit. (Except maybe if it's a merge commit, in which case the contents of the previous file system snapshot are relevant. But even there, that shouldn't include the hash, which represents all the ancestors.) Information about commits' relationships should be external information, whose manipulation won't suddenly alter the commits or their identities themselves.
This isn't a radical proposal or something. For starters, git's own documentation (which I already linked here) literally say "Git thinks of its data more like a series of snapshots of a miniature filesystem". Well, the snapshot of the file system doesn't include the history of how it came into creation, so doesn't that mean that shouldn't be part of your commit? That's already a contradiction with its own principles right there. And going beyond that, most things we do with commits already revolve around the file system snapshots, not the history. e.g. when you sign a commit, you sign the snapshot, not the history. Or when you say this guy is the "committer", you're just talking about the snapshot, not the history. And when a commit gets inserted into the middle of the history, that logically doesn't affect you, and in practice, you don't want it to trash the commit you're one. The identity of your commit is still the same after all -- it's the same snapshot.