问题描述:

Dozens of questions and answers on SO and elsewhere emphasize that Git can't handle large files or large repos. A handful of workarounds are suggested such as git-fat and git-annex, but ideally Git would handle large files/repos natively.

If this limitation has been around for years, is there are reason the limitation has not yet been removed? I assume that there's some technical or design challenge baked into Git that makes large file and large repo support extremely difficult.

Lots of related questions, but none seem to explain why this is such a big hurdle:

  • git with large files
  • What are the file limits in Git (number and size)?
  • Git - repository and file size limits
  • Versioning large text files in git
  • How to handle a large git repository?
  • Managing large binary files with git
  • What is the practical maximum size of a Git repository full of text-based data? [Quora]

网友答案:

Basically, it comes down to tradeoffs.

One of your questions has an example from Linus himself:

[...] CVS, ie it really ends up being pretty much oriented to a "one file at a time" model.

Which is nice in that you can have a million files, and then only check out a few of them - you'll never even see the impact of the other 999,995 files.

Git fundamentally never really looks at less than the whole repo. Even if you limit things a bit (ie check out just a portion, or have the history go back just a bit), git ends up still always caring about the whole thing, and carrying the knowledge around.

So git scales really badly if you force it to look at everything as one huge repository. I don't think that part is really fixable, although we can probably improve on it.

And yes, then there's the "big file" issues. I really don't know what to do about huge files. We suck at them, I know.

Just as you won't find a data structure with O(1) index access and insertion, you won't find a content tracker that does everything fantastically.

Git has deliberately chosen to be better at some things, to the detriment of others.


Disk usage

Since Git is DVCS (Distributed version control system), everyone has a copy of the entire repo (unless you use the relatively recent shallow clone).

This has some really nice advantages, which is why DVCSs like Git have become insanely popular.

However, a 4 TB repo on a central server with SVN or CVS is manageable, whereas if you use Git, everyone won't be thrilled with carrying that around.

Git has nifty mechanisms for minimizing the size of your repo by creating delta chains ("diffs") across files. Git isn't constrained by paths or commit orders in creating these, and they really work quite well....kinda of like gzipping the entire repo.

Git puts all these little diffs into packfiles. Delta chains and packfiles makes retrieving objects take a little longer, but this it is very effective at minimizing disk usage. (There's those tradeoffs again.)

That mechanism doesn't work as well for binary files, as they tend to differ quite a bit, even after a "small" change.


History

When you check in a file, you have it forever and ever. Your grandchildren's grandchildren's grandchildren will download your cat gif every time they clone your repo.

This of course isn't unique to git, being a DCVS makes the consequences more significant.

And while it is possible to remove files, git's content-based design (each object id is a SHA of its content) makes removing those files difficult, invasive, and destructive to history. In contrast, I can delete crufty binary from an artifact repo, or an S3 bucket, without affecting the rest of my content.


Difficulty

Working with really large files requires a lot of careful work, to make sure you minimize your operations, and never load the whole thing in memory. This is extremely difficult to do reliably when creating a program with as complex a feature set as git.


Conclusion

Ultimately, developers who say "don't put large files in Git" are a bit like those who say "don't put large files in databases". They don't like it, but any alternatives have disadvantages (Git intergration in the one case, ACID compliance and FKs with the other). In reality, it usually works okay, especially if you have enough memory.

It just doesn't work as well as it does with what it was designed for.

网友答案:

It's not true that git "can't handle" large files. It's just that you probably don't want to use git to manage a repository of large binary files, because a git repo contains the complete history of every file, and delta compression is much less effective on most kinds of binary files than it is on text files. The result is a very large repo that takes a long time to clone, uses a lot of disk space, and might be unacceptably slow for other operations because of the sheer amount of data it has to go through.

Alternatives and add-ons like git-annex store the revisions of large binary files separately, in a way that breaks git's usual assumption of having every previous state of the repository available offline at any time, but avoids having to ship such large amounts of data.

网友答案:

It's because every checkout holds every version of every file.

Now, there are ways git mitigates this issue, such as binary diffs and sparse clones, but certainly every client will have at least two copies (one in the work tree, one in the repository) of every file. Whether this is an issue for you depends on your circumstances.

相关阅读:
Top