I want to preface this by saying that any sort of command line mucking with your email should begin with a full backup. Part of the reason I moved to mutt was to make my local mailbox the source of truth for my email, and thus making me explicitly responsible for backing it up.
Since the mailbox format is stored entirely in text files, and I only have about 6k emails, I just created a new git repo in my mailbox and added everything. Git is actually perfect for this because it’s designed to track changes to many small text files. After each step in this process I just made a new commit. A bonus of this is you can see what changes are made by various mbsync commands.
You have been warned.
So I’ve been using mutt as the primary email client on one of my systems for a few months now. It took a while to set up but I love the simplicity of it. I still use some other mail clients if I need to view an HTML email for some reason but plain text is great for most things.
Being a CLI app means it’s extremely fast, I can navigate via keyboard, I can access it remotely with ssh. It’s also free from tracking pixels and HTML, which means when I get a phishing email the first thing I see is the phony URLs and not the legit-looking graphics. You can even set it up allow you to edit the headers on outgoing messages by default, which is not super useful but it’s kind of a neat trick and means I can easily send messages from anything@mydomain.
Sync all the things
I started out syncing my local mailbox from Fastmail using mbsync but only the INBOX folder, not the Spam/Archive/Trash etc. Once I got a bit more confident I decided to modify my workflow so when I archive a message locally, it gets moved to the Archive folder and removed from the Inbox, and then these changes get pushed back to the IMAP server.
I modified my .mbsyncrc from Sync Pull to Sync All, held my breath and ran mbsync -a
It worked! Except somehow I ended up with some duplicate folders. Folders that only differ by capitalization, like sent/Sent, archive/Archive, etc. I think locally mutt created the lowercase ones and Fastmail created the capitalized ones and now I have messages in each.
My solution to this (pro tip: don’t do this) was to simply merge the folders.
mv sent/* Sent rmdir sent
Problem solved (or so I thought)
My guess is that mutt assigns unique identifier that are only unique per folder. I started seeing warnings when syncing:
Maildir error: duplicate UID 1.
A short google search later and I found a blog post explaining that the UIDs are stored in the name of the message file (cytokine is/was the name of my laptop)
The part at the end with
U=5065 is the UID assigned by mutt. To fix duplicate UIDs (said the blog post) simply rename the file and remove the UID, and mutt will regenerate it.
I did this manually for the first duplicate UID, and then the second, and then I realized that mutt would only point out the first duplicate UID and then quit (it wouldn’t give you a list of all duplicate UIDs)
A large number of duplicate UIDs was easy enough to fix (I thought, stupidly) using perl-rename: I’ll just strip all the UIDs.
perl-rename 's/,U=1\:.*//' */cur/*'
Unfortunately what I didn’t realize was that at some point I had actually created duplicates of some messages, and that the duplicate UIDs were a symptom, not the root cause.
If you are trying to fix duplicate messages in a mailbox, start here.
There’s a great tool called rmlint (hopefully available from the repos of your Linux distro of choice – I use Void currently) that will scan a folder for duplicate files (based on hashing) and remove duplicates, leaving one copy of each.
Unfortunately it didn’t work right away – the initial scan said every file had a unique hash. I knew this was not true – I picked a specific message that I’d found multiple copies of from an order I placed with a beer delivery company.
I hashed the two copies of the same message and sure enough they produced different hashes. A quick diff showed why: an email header called X-TUID.
Checking a few more messages confirmed that this header was the only thing preventing these files from hashing to the same value.
I was able to remove the headers pretty easily with another sed command:
sed -i '/X-TUID/d' *
Now rmlint would behave correctly, but I wasn’t sure how mbsync would behave without this. Would I just cause more confusion? Then I realized that rmlint doesn’t actually delete the files when you run the command – it generates a script to do it after the fact. This feature never made sense to me before but it was perfect for this use case.
What I did was:
- Make a copy of my mailbox
- Use sed on the copy to strip the X-UID headers
- Run rmlint on the copy but don’t execute the script it produces yet
- Copy the script back to the original mailbox where the X-UID headers were still intact and run it there
This worked well because the script just removes filenames and the filenames weren’t changed when I removed the X-UID headers, so the script had no idea it wasn’t deleting the same files that were compared to generate it.
I verified that this had actually worked (also using git, which again, is perfect for this) and then set mbsync to sync changes in both directions.
The lessons I learned here include not deleting UIDs and regenerating them until you’re sure that messages are actually not duplicates, and also why rmlint defaults to not running the actual removal right away.
Also when experimenting with your mailbox, git is actually the perfect backup tool because it’s all plain text and you can see what changed in between revisions.
I still haven’t figured out what X-UID is for but I wouldn’t be surprised if base64 decoding it yields the same number that’s encoded in the filename.
4 thoughts on “Deduping mutt with rmlint”
Interesting story. I now know what to watch for when doung stuff like this. Thanks for sharing this.
Ran into exact same situation while syncing gmail. This article helped !! Thanks a lot of sharing this !
Also from here : https://gist.github.com/lewisthompson/bb0e0399254c90cf36dba03956bd2ff0 , setting `CopyArrivalDate yes` for mbsync seems to have resolved this for the gist author.