hrithik gautham TG
hrithik gautham TG

Reputation: 470

Is Git distributed or decentralized?

I know git used version control to track files. And it is also distributed, meaning more than one computer stores the relevant files. But my doubt is if git is distributed or decentralized? If it is decentralized, then why do we need github, gitlab? using Github and Gitlab makes it distributed(one master multiple slave nodes) right? since, we have a master(like github) from which clients(collaborators) depend on it. But git takes advantage of blockchain(of sorts) technology, which makes me think that git is decentralized, since all the blockchain technology applications like bitcoin, ethereum are decentralized. Unlike bitcoin, there is no peer-to-peer communication within the nodes in git, which contradicts the decentralized nature of blockchain. We need github to communicate with the other nodes or if we were to collaborate with others. please someone tell me if git is distributed or decentralized?

Upvotes: 16

Views: 12213

Answers (2)

Z4-tier
Z4-tier

Reputation: 7988

Git is both (and it is neither).

It is distributed...

...in the sense that anyone with a clone of a particular repository is theoretically "equal" to any other developer with a clone of the same repo. One of the main reasons this approach is used is to allow any developer to continue their work without the need to always be connected to a centralized master server. If you have your own complete copy, and it's "equal" to any other, you can develop against it and sync up later.

It is decentralized....

...mainly for the same reason given above. One of the core concepts is that there is no "main" server. The problem with that is, in many situations (like a software engineer for a large company), there really is a need to have a centralized master. It's not that Git isn't meant for this type of workflow (clone --> develop --> commit --> push to central repo), but rather that it doesn't force it upon you. Since that has been such a ubiquitous way of working, it's become the norm to use GitHub on top of Git to provide the desired structure to enable this type of development cycle.

It is neither?

Because it doesn't force you to use any specific workflow model, it is perhaps also reasonable to conclude that Git is neither distributed nor decentralized: it largely transcends these implementation details, allowing users to use it however they wish. It includes functionality that is abstract and flexible to such an extent that it can fit into almost any workflow, but how that works is left for the users to decide. This is also one of the main reasons why Git is so difficult for newcomers to learn.

So just remember that Git and GitHub are not the same. Git is a version control tool, and GitHub is a collaboration tool that happens to use Git, and provides a framework for a specific type of development cycle that is very well established and familiar to many people.

Also, git can communicate with any host, it is in no way dependent on GitHub to provide centralization, even though we often treat it as if that were the case. Git can use SSH, HTTP(S), and even it's own proprietary protocol to push and fetch data from a repo on any other system, provided the user has the ability to log in to that host.

What about Blockchain?

Git does use the same underlying data structure— called a hash tree (or Merkle tree)— as many common blockchain implementations (ex: Bitcoin, Ethereum). What is more, both git and blockchain have some very similar requirements: they both seek to be decentralized and distributed. But how those features fit into the overall purpose of the two technologies is quite different.

With blockchain, the notion of decentralization is heavily focused on the need to maintain consensus: it is of fundamental importance to the integrity of the blockchain that the majority of the nodes agree on the content of the ledger that they are building. That is because each entry is predicated on the correctness of the previous one. Without consensus, the overall usefulness of a blockchain is unclear.

Compare that to Git, and while some might argue that consensus is also important in maintaining the integrity of a repository, it is not so intrinsic to the general usefulness of Git as a tool. Two clones of the same repo can become massively out of sync without diminishing my ability to use either (or both) of them for version control. It also doesn't preclude my ability to utilize parts of both, as long as I don't mind doing some manual merging. Git even allows for some very extensive "tree surgery" wherein I can freely rewrite history, picking pieces from different sources (even sources without a common ancestor) and stitching them together, ex post facto, to create a chain of events that is pure fiction.

So while these two technologies have some superficial similarities— and some that are a bit deeper, too— they serve different purposes and have their own unique design requirements, and as such they are not directly comparable to one another.

Upvotes: 23

telamon
telamon

Reputation: 425

Remembering that I've already spent a year researching the same question, I find it hard to walk away without at least leaving a note. It is after all an excellent question.

Given that "distributed" in the question refers to a system with a central node - then Git is brilliantly agnostic to infrastructure politics.

By itself it is neither centralized nor decentralized, it is a fully functional chain of blocks and it is OFFLINE.

While being offline it has the potential to be both distributed and decentralized but it is neither until the user pushes or pulls to/from a remote. Git also supports multiple remotes so using git in a centralized manner does not limit it's decentralized capabilities.

The reason we are using Git with a central hub is because a decentralized alternative offering similar cost-effectiveness and conveniences as the cloud-platforms - does not yet exist.

There are however valid distributed remotes:

hypergit creates a git-remote that points to a one-to-many (single author) p2p-swarm, making commits originating from the central node serverlessly distributed.

If you and a couple of friends decide to create your own individual hypergit endpoints and agree to always attempt to fetch from each person's endpoint before doing a push; then you have a fully decentralized solution among each other. However you'll quickly notice that this model scales awkwardly and synchronization complexity grows exponentially in relation to the number participants added to your group.

To clarify the problem: in the model above we introduced a naive global time lock to reduce the risk merge-conflicts - since Git does not have an "Automatic Conflict Resolution Policy" the default behavior is to raise the alarm and let the user manually correct any merge-conflicts. But what happens when both you and your friend unknowingly resolve the same merge conflict and maybe even manage to produce different results?

In a centralized system this is a somewhat unfair but familiar race - he who first manages to push a non-conflicting commit to origin/master gets to go home first for the day. But what do you do when there are multiple remote origins?

Or as a junior in a git-swarm containing conflicting merge-conflict-resolutions, how do I know from which peer to pull? I might stand up and ask:

"I see conflicts everywhere, who of you has the latest non-conflicting state?"

After a moment of discussion a few fingers should be pointed an individual remote. Meaning, the team arrived at a consensus on who's master branch to use.

In a fully decentralized system the time it takes to interrupt your neighbors and reach a consensus, is plenty enough time for new commits to wind up on conflicting branches generating a completely new set of conflicts in need of resolution.

So to solve that issue we apply a bit of swarm-intelligence and equip each peer with an "Automatic Conflict Resolution Policy"

let's say:

The branch that contains the most recent commits by seniority should be considered canon.

(Disregarding the fact that no single clock shows the same time) we can aggregate the output of git log to produce a comparable vector clock using this dirty one-liner:

ruby -e 'puts `git log --full-history --reverse "--format=format:%at;%an--%ae"`.split("\n").reduce({min: {}, d: {}}) {|out, line| t, a = line.split(";"); out[:min][a] = [t.to_i, out[:min][a] || t.to_i].min; out[:d][a] = t.to_i - out[:min][a];out}[:d].values.sort{|a,b| b <=> a}.join(":")'

This would allow each peer to always know which HEAD to pick in case of a conflict without having to interrupt it's neighbors.

With autonomous conflict resolution we have theoretically solved the previous scaling issue and can now discard all the individual swarm-endpoints in favor of one many-to-many sparsely connected swarm where commits are forwarded, merged and discarded according to policy in a decentralized manner.

Git is now a Blockchain(TM)

...

I am currently studying "Offline First" software designs, having written a nano-sized consensus-free offline blockchain I am horribly stuck trying to write a report on the subject.

Describing something as: "... is decentralized in the same manner as git" just doesn't sit right.

So I reached this question by searching for "Is Git considered decentralized?"

Well until somebody corrects me I am left no choice but to proclaim myself as the expert in this context, and I say:

TL;DR;

Git is inherently neither decentralized nor distributed it is offline and just like a real life git, it doesn't care.


If I may add one more paragraph to the subject.

the following two projects illustrate that the Git "chain" can be used to host arbitrary functionality, they both directly tap into and enrich Git's potential for distributed and decentralized use.

git-dit

sit

Upvotes: 10

Related Questions