git dir2mod: subdir to submodule
published on Tuesday, June 13, 2017
Want to publish a git repository, but need to reduce the history size of the main repository by changing one of its folders into a submodule – including all branches and tags? This is a write up of the steps I took and at the end I will link a script that can do the whole job.
Step by Step
There are three basic steps:
Extract submodule
Our first step, Splitting a subfolder out into a new repository is a common task and the standard method to do it works as follows (don't skip to CLONE or you will lose data!):
And boom, you're done.
(Use --mirror to copy all your branches and tags and make a bare repository!)
Remove untouched branches (optional)
Assuming your branches and tags did form a connected graph before the rewrite, you can remove the ones that did not contain the subdirectory in question as follows:
Compress the new submodule (optional)
Remove unused leftovers from your new repository:
Create index of submodule commits
Next, we create an index that maps the SHA1 of the subdirectory tree to the SHA1 of the associated commit in the submodule.
You should now have a file called treemap with the hashes of the subdirectory tree and corresponding submodule commit.
Note, that this approach is only sensible if you never have the same tree twice.
We are now done with the submodule, let's go back to the folder where both the original repository and submodule are located:
Rewrite the main repository
First off, clone your original repository! You don't want to lose data if something goes wrong:
Now, for simplicity export up pathes for later use:
And create a file with the name gitmod in the directory of the clone with the content that should be put in the .gitmodules file, e.g.:
(Note, the code below assumes that this file is located in the git directory, so if you did not clone into a bare/mirror repo, you will have to move it to .git/ or adjust the pathes accordingly.)
Before proceeding, we will also extract the treemap file into a directory treemap.dir that will be more convenient to access from a shell script:
Finally, run filter-branch:
With this itchy helper script in the git directory:
Okay, this may look a bit monstrous but what it does is simply lookup the correct commit ID for the tree that's currently at the subfolder's location and replace the subfolder and the .gitmodules file accordingly.
For large repositories, this might be quite slow. If you don't want to wait for hours, keep on reading:
Speed up the third step
As mentioned in "git unpack: efficient tree filter", tree filters can be made a lot faster by parallelizing the tree rewrites and caching subtrees that have already been computed.
Instead of the single filter-branch command, we now proceed in two phases. First, use the python module to rewrite the trees (parallelized):
This creates an index of OLD_TREE → NEW_TREE that associates to the root tree of every existing commit its rewritten root tree. We will extract this index into an easier to access directory structure:
And second, rewrite the commits (sequential):
And a multi hour job can now be done in few minutes – there is still room for performance improvements here. Feel free to submit questions and pull-requests with your own adaptations on github.
Compress the new parent repository (optional)
Be sure to do this only if you have cloned the original repository. Otherwise you can lose data!
TL;DR: I want this done quickly
I have assembled a script that performs all of these steps for you. Use it as follows:
With the following parameters: