Sane Media Files Management using Git-Annex

Posted on August 17, 2020

Problem statement

Building up your collection of video and music feels good. But when you need to actually find THAT file, lying somewhere in your stack of files, you can only hope your memory doesn’t fail you.

So, you did the following …

  • organize them into directories, by categories or whatever fancy scheme you come up with
  • try to encode as much info into filename so that a file search would hit

However, as your collection grows beyond terrabytes, these methods start to fail. You encounter more and more files that you cannot decide which directory it should go to. And file search becomes increasing slow. And these methods don’t work if your files are spanned across multiple drives, NASs or even some blu-ray discs.

I had all of these headaches until I find git-annex. It solved all my problems regarding media file management altogether.

How git-anenx solves it

(For implementation detail, see https://git-annex.branchable.com/how_it_works/)

Files across multiple storage

When you add a file using git annex add command, only the metainfo will enter the repository. The actual file will be moved into .git while a symlink is created in its original location. When you push or pull, only the metainfo and symlink is transfered.

Now you can clone this repository everywhere where you need access to your files. You can access all file’s metainfo(symlink) while the actual file may be on different remote sites. When you need the content of the file,‘git annex get’ command automatically retrieves the file for you as long as you are connected to that remote. You can use git annex drop to remove the local file and git-annex guarantees that at least one site will have the actual content. You can even configure git-annex to automatically keep some number of backup copies across different sites.

Metadata management

Admit it. Filenames are just not enough. Sadly, filesystems today are directory-based rather than tag-based. NTFS supports limited metadata management for media files but Windows sucks. There are also external solutions storing metadata in databases or in propriatary formats, but I cannot depend on these softwares outliving my data.

With git-annex you can ditch them all. You can add adding metadata and tags to any file within the repository. And the format is open and stable.

My git-annex setup

I created a single repository named “Annex” to manage all my media files. They are stored in two NAS home servers and also one HDD attached on my main PC. On NAS I used nfs for file sharing as samba does not support symbolic links.

I also make cold backups using read-only bluray discs. Check this page.

I am trying to attach sane amount of metadata to my files. I tag the name of person the video is related to, song title of music videos and etc… I also attach field “onair” to store the date when the clip go public as git-annex supports searching using comparison operators (<, <=, >, >=).

In the migration process, I made a python script to retrieve as many info encoded in the filename. However there are many files that I did not properly name so I have to manually tag them. But the earlier you bite the bullet the less pain you will have in the long run.

The next step

Sadly, there seems to be no file manager integration. Ideally it should automatically retrieve the content of a file when you open a symlink in a git-annex repository. As for now, you will have to use the commandline to do it manually. I am planning on writing an emacs package to enhance dired-mode (I am using emacs as my file manager now) if I have some time. There is existing packages (git-annex.el) though they are not fully automatic and lacks functionalities regarding tagging and searching.