The identity of a Git commit explained

A git commit is a snapshot of the repo at a particular point in time. They're used to save and track changes in a codebase, and are the building blocks of Git. At a high level at least, at a more fundamental level everything in Git is stored as an Object. Learning how commits work is fundamental to understanding how to take advantage of Git's full potential.

What is in a Git commit?

A Git commit object contains all the metadata of the commit. This includes the date, author, committer, the commit message, as well as the directory tree object hash, and parent commit hash(es).

To see the contents of the most recent commit and it's metadata on the current working branch, run git show --pretty=raw. It'll look something like this:

commit 4a07d916c1802d606a41e00b52e2068f6510ded6
tree 7ed44fc18fc76c66631ac3d4169107b1a48f9c58
parent a08bc7c445e17faad93000089aef41c71f129ade
author Jane Smith <jane@alembic.com.au> 1636455211 +1100
committer Jane Smith <jane@alembic.com.au> 1636455211 +1100

example commit

diff --git a/README.md b/README.md
new file mode 100644
index 0000000..e69de29

<...the rest of the diff...>

You can see all the metadata we mentioned above. The author, date and commit message are obvious but the commit, tree and parent hashes are worth explaining in more detail.

The hash of a commit's parent commit acts as a pointer linking the all commits together. A commit can have multiple parent commits if a merge of one or more branches has been made.

The tree object referenced in the above commit contains the directory listing for the commit. This is made up of other branching Tree objects (sub-directories) and Blob objects (Binary large object containing the contents of the file). You can read more about Tree and Blob objects here.

What if I amend a commit?

Git commits are immutable. If you amend a commit and make changes, Git will create a completely new commit and a new commit identifier. It doesn't matter what metadata you change, the identifier will change. The old commit with it's original identifier still exists, and although it may not be visible we'll show you how to find it later on in this article.

Commits don't reference any of their children, so any changes to the commit's children won't alter the parent. Any changes to a parent commit however, will result in a new commit identifier for each of its children. This will be important to remember when we introduce Git rebase.

How do hash functions work?

A hash function is an algorithm that takes in data and returns a unique value. The most important features of a hash function are:

A hash function will always produce the same output for a given input
Any small change in the input data will produce a completely different output
The hash function is not reversible.

The unique 40-character identifiers of a Git commit are SHA-1 hashes. For more details, check out how hashing works and specifically SHA-1.

Small changes in input change the output a lot

Let's test this in our terminal:

$ echo -n "abcd" | openssl sha1
81fe8bfe87576c3ecb22426f8e57847382917acf

$ echo -n "abcd" | openssl sha1
81fe8bfe87576c3ecb22426f8e57847382917acf

The input "abcd" will return a hash of 81fe8bfe87576c3ecb22426f8e57847382917acf regardless of how many times it is run, or what system it is run on. If you change the text, even slightly, the returned hash will be wildly different. In the following cases we only change the input "abcd" by 1 single bit. Notice how different the resulting hash is?

$ echo -n "abcd" | openssl sha1
81fe8bfe87576c3ecb22426f8e57847382917acf

$ echo -n "bbcd" | openssl sha1
3c1238dc8e11a8de5af8e80e3d19f9c4b53b629e

$ echo -n "abce" | openssl sha1
0a431a7631cabf6b11b984a943127b5e0aa9d687

This means that if one change is made to a file, directory or the commit's metadata, no matter how small that change is, you will receive an entirely new (and very different) hash identifier.

Hash functions are irreversible

Hash functions are irreversible, so there is no practical way to find the contents that made the hash identifier. If this was possible it would break the cryptographic security of git. It is very unlikely but possible for a SHA-1 collision to occur, where different contents produce the same identifier.

Two objects colliding accidentally is exceedingly unlikely. If you had five million programmers each generating one commit per second, your chances of generating a single accidental collision before the Sun turns into a red giant and engulfs the Earth is about 50%.

However unlikely, it could be used as part of a collision attack. In 2017 the first SHA-1 collision was detected on GitHub, which is an interesting read. TL;DR - Git is not likely to be exploited in this way anytime soon, for reasons explained in the article.

Everything you commit to git is safe

Remember that Git commits are immutable. Git never really deletes anything that's been committed, unless you explicitly delete it. Even when performing operations such as rebasing or commit amending that modify commits. When we change a commit in this way, they are not actually deleted, a new commit is made and the old commit continues to be stored by Git even though it may not be visible through ordinary commands.

When a commit has nothing referencing it (a branch for example), it becomes invisible (to commands like git log). Git hides these from you because otherwise it would very hard to navigate around the mess of old commits. However the old commits still exist.

It's important to remember that you can always reference any commit by its SHA-1 hash. Remembering 40 character hashes is probably beyond the capabilities of most mortals, so one very handy way that we can find these old, invisible commits is in the reflog.

Finding invisible commits with git reflog

References (or "refs") can be thought of as named pointers to commits. Reference logs, or "ref logs" represent the history of every commit that git has ever pointed to. Any time the the tip of branches are changed it's recorded in the reflog. This could be through switching branches, merges, or even a rebase or amendment. This means that if you accidentally "lose" some data, there's a very good chance it is in the reflog, and you can retrieve it from there.

git reflog will show the reflog for the current working directory. By default, git reflog will output the reflog of the HEAD ref. You can get a complete reflog of all refs by executing git reflog show --all.

Summary

A Git commit contains all of the commit metadata and a snapshot of the repo at that point in time. The unique 40 character hash that represents the commit allows Git to track your commits and changes to the repo over time. They cannot be reverse engineered to see their contents.

When a commit is modified with git commit --amend or git rebase, it is not deleted, only replaced with a new commit with a new identity. We can always retrieve commits that are no longer visible via the reflog. Knowing the hash of the commit will make it easy to undo almost any git command with git reset --hard.

Understanding Git fundamentals and what's happening behind the curtain of Git will quickly increase your confidence and abilities, allowing you to take full advantage of Git and its tools.

Stay tuned as we'll be publishing a series of articles about understanding Git fundamentals and improving your Git Fu.