git dir2mod: subdir to submodule
published on Tuesday, June 13, 2017
Want to publish a git repository, but need to reduce the history size of the main repository by changing one of its folders into a submodule – including all branches and tags? This is a write up of the steps I took and at the end I will link a script that can do the whole job.
There are three basic steps:
Our first step, Splitting a subfolder out into a new repository is a common task and the standard method to do it works as follows (don't skip to CLONE or you will lose data!):
And boom, you're done.
(Use --mirror to copy all your branches and tags and make a bare repository!)
Remove untouched branches (optional)
Assuming your branches and tags did form a connected graph before the rewrite, you can remove the ones that did not contain the subdirectory in question as follows:
Compress the new submodule (optional)
Remove unused leftovers from your new repository:
Next, we create an index that maps the SHA1 of the subdirectory tree to the SHA1 of the associated commit in the submodule.
You should now have a folder called treemap with one file for each distinct state of the subdirectory. The file is named after the SHA1 hash of the tree and contains the SHA1 hash of the associated commit.
We are now done with the submodule, let's go back to the folder where both the original repository and submodule are located:
First off, clone your original repository! You don't want to lose data if something goes wrong:
Now, for simplicity export up pathes for later use:
And create a file with the name gitmod in the directory of the clone with the content that should be put in the .gitmodules file, e.g.:
(Note, the code below assumes that this file is located in the git directory, so if you did not clone into a bare/mirror repo, you will have to move it to .git/ or adjust the pathes accordingly.)
Finally, run filter-branch:
With this itchy helper script in the git directory:
Okay, this may look a bit monstrous but what it does is simply lookup the correct commit ID for the tree that's currently at the subfolder's location and replace the subfolder and the .gitmodules file accordingly.
For large repositories, this might be quite slow. If you don't want to wait for hours, keep on reading:
Speed this up
As mentioned in "git unpack: efficient tree filter", tree filters can be made a lot faster by parallelizing the tree rewrites and caching subtrees that have already been computed.
Instead of the single filter-branch command, we now proceed in two phases. First, use the python module to rewrite the trees (parallelized):
This creates an index of COMMIT → TREE that associates to every existing commit its rewritten root tree.
And second, rewrite the commits (sequential):
And a multi hour job can now be done in few minutes – there is still room for performance improvements here. Feel free to submit questions and pull-requests with your own adaptations on github.
Compress the new parent repository (optional)
Be sure to do this only if you have cloned the original repository. Otherwise you can lose data!
I have assembled a script that performs all of these steps for you. Use it as follows:
With the following parameters: