Merge multiple Git repositories into one retaining history

Task description

A while ago in a customer project we needed to merge multiple (5) Git repositories into single one retaining full history. The separate repositories had been under active development for the last two years and their combined size had grown to over 600MB. This operation is a bit more involved than basic use of Git so I decided to write a little blog about it.

Further it was required that all the branches and tags from the separate repositories would be merged into the new combined repository. In the process we needed to find and erase the largest unnecessary files from Git history to keep the new repository slim in size. The destination repository would contain the source repositories under their own subdirectories.

Workflow

The implementation plan turned out as in the picture below. The workflow proceeds as follows:

  1. Find out space taking files in the source repositories’ Git history
  2. Clone source repository to workstation
  3. Clean up source repository on workstation
  4. Rewrite source repository’s history on workstation
  5. Merge master branch from source repository to destination repository
  6. Merge other branches from source repositories into destination repository
  7. Merge tags from source repositories into destination repository

git_blogi

Find out space taking files in the source repositories’ Git history

The job starts by finding out the most space taking files in the source repositories’ Git history. The analysis is performed with a script [1] published by Antony Stubbs. The script works by going through the repository’s packfiles [2] and printing them out in order and nicely formatted. The repository has to be optimized with git-gc [3] command for all the files to be included in the scan. After the analysis is complete we have a list of files to be erased from the source repositories’ history in a later phase.

Clone source repository to workstation

The actual migration process starts by cloning the source repository to the workstation including all the branches and all the tags. We perform so called deep clone [4] and this is a prerequisite for the cleaning operation to succeed. After this the remotes of the cloned repositories are removed so that we don’t end up making changes in the original repositories by accident.

Clean up source repository on workstation

At this stage we perform the cleanup tasks. The files determined to be removed in the analysis phase are erased from the Git repository’s history. The following code shows how a single file (bigfile.zip) and a single directory (bigfolder) can be cleaned up from a repository including all the branches and all the tags.

Here we use the most robust Git history rewriting tool [5] filter-branch [6]. The command has been designed for rewriting large amount of commits in a scriptable way. The most important parameters are –index-filter and –cached which tell the command to process the commits making use of the index instead of the checked out files (as –tree-filter does). This way the processing proceeds significantly faster. The other parameters –tag-name-filter cat processes all the tags and –all all the branches.

Rewrite source repository’s history on workstation

After cleaning up we rewrite the repository’s history to make it look like the files had always resided in a subdirectory. This is needed so that we can still easily access the revision history of the files with git log [7] command without the –follow switch that is normally required after moving files. The following code rewrites the repository’s history to make it look like the files had always been in a subdirectory named myfolder.

Merge master branch from source repository to destination repository

Finally we merge the processed repository into the newly founded repository. When all the separate repositories have been merged, we push the result to the remote.

Merge other branches from source repositories into destination repository

At this stage the combined repository’s master branch is complete. Next we merge the other branches from the source repositories.

Merge tags from source repositories into destination repository

Now we have merged all the branches we wanted into the new repository. As the last step, we need to merge the tags. We can accomplish this by creating a new branch from the tag we are processing and then merging the branches into the new repository.

Conclusion

Overall, performing this migration took about three hours. Most of the time was spent in the repository cleanup stage. The cleanup operations turned out to be lengthy for large repositories. It would be possible to optimize this time using external tool such as BFG Repo-Cleaner [8] instead of Git’s filter-branch. However, with basic analysis and cleanup we were able to cut the resulting repository’s size in half to about 300MB.

Initially the task seemed difficult. But eventually with the tools Git has to offer, the migration turned out to be rather easy.

Sources:
[1] https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
[2] http://schacon.github.io/gitbook/7_how_git_stores_objects.html
[3] https://www.kernel.org/pub/software/scm/git/docs/git-gc.html
[4] http://stevelorek.com/how-to-shrink-a-git-repository.html
[5] http://git-scm.com/book/en/v2/Git-Tools-Rewriting-History
[6] http://git-scm.com/docs/git-filter-branch
[7] http://git-scm.com/docs/git-log
[8] https://rtyley.github.io/bfg-repo-cleaner/