Diffing and Patching Large Binary Files Part II

After reading my last post a second time I realized that without further explanation the diff-binary.sh and patch-binary.sh scripts look just like a wrapper around a specific rsync call. But there is a little bit more to it. Therefore, this post describes the rationale behind these scripts and enhances them to some extent (with input handling & hashing).

But first, let’s see the script(s) again (with a slightly more sophisticated input handling). First diff-binary.sh:

#!/bin/bash

if [ $# -ne 3 ]; then
    echo "Usage: diff-binary.sh oldfile newfile patchfile"
    echo "  oldfile is the last binary file"
    echo "  newfile is the new binary file"
    echo "  patchfile the resulting patch file :)"
    exit 1
fi

oldfile=$1
newfile=$2
patchfile=$3

if [ -d "$oldfile" ]; then
    echo "oldfile \"$oldfile\" is a directory."
    exit 1
fi

if [ ! -f "$oldfile" ]; then
    echo "oldfile \"$oldfile\" does not exist."
    exit 1
fi

if [ -d "$newfile" ]; then
    echo "newfile '$newfile' is a directory."
    exit 1
fi

if [ ! -f "$newfile" ]; then
    echo "newfile '$newfile' does not exist."
    exit 1
fi

if [ -d "$patchfile" ]; then
    echo "patchfile '$patchfile' is a directory."
    exit 1
fi

if [ -f "$patchfile" ]; then
    echo "patchfile '$patchfile' already exists."
    exit 1
fi

rsync -ar --only-write-batch="$patchfile" "$oldfile" "$newfile"
rm "${patchfile}.sh"

Second, patch-binary.sh:

#!/bin/bash

if [ $# -ne 3 ]; then
    echo "Usage: patch-binary.sh patchfile fullfile patchedfile"
    echo "These files refer to the corresponding files from the diffBinary.sh command: "
    echo "  patchfile ^= patchfile <- the result of binaryDiff.sh"
    echo "  fullfile ^= newFile"
    echo "  patchedfile ^= oldFile <- here patchBinary.sh will store its result :)"
    exit 1
fi

patchfile=$1
fullfile=$2
patchedfile=$3

if [ -d "$patchfile" ]; then
    echo "patchfile \"$patchfile\" is a directory."
    exit 1
fi

if [ ! -f "$patchfile" ]; then
    echo "patchfile \"$patchfile\" does not exist."
    exit 1
fi

if [ -d "$fullfile" ]; then
    echo "fullfile '$fullfile' is a directory."
    exit 1
fi

if [ ! -f "$fullfile" ]; then
    echo "fullfile '$fullfile' does not exist."
    exit 1
fi

if [ -d "$patchedfile" ]; then
    echo "patchedfile '$patchedfile' is a directory."
    exit 1
fi

if [ -f "$patchedfile" ]; then
    echo "patchfile '$patchedfile' already exists."
    exit 1
fi

cp "$fullfile" "$patchedfile"
rsync -ar --read-batch="$patchfile" "$patchedfile"

So what happens here? The scripts check whether the input files exist and are files and not directories. For the files that are created by the script ($patchfile and $patchedfile, respectively) it ensures, that there is no file currently which would be overwritten. That helps to reduce sources of errors, but is no voodoo at all.

So what is special about the use of rsync here? rsync is a wonderful tool for copying whole directory trees from one place to another, e.g., in order to create backups. It has several smart features that reduce the amount of traffic if, e.g., just a fraction has changed since the last backup. It can even store the (binary) diff of a source and destination file that has changed. This can be used to copy and apply these diffs to multiple destinations with the same data. In a blog post describing this approach therefore the command rsync --write-batch=diff current previous is used to create a patch. But we do not want to create a diff that is lateron used to create the current version by applying a diff to a previous version. Instead, we recreate the previous version by applying a patch to the current version.

Why is that? As in event sourcing or in repository technologies like git we want a snapshot of the current version for fast access. It is most likely, that we will restore from the most recent backup, in the event of disk failure or something similar. So, it is OK to recreate older versions by applying incremental revisions as it is less likely to happen.

Therefore, we use the rsync command the other way around to create a patch, which can be applied on the current version to recreate the previous version (and so on if we need to recreate an older version). Hence, we do not use the created shell script of rsync (it is deleted at the end of diff-binary.sh).

In patch-binary.sh we create a copy of $fullfile and name it $patchedfile. Then we apply our patch file to the $patchedfile. The result is that $patchedfile represents the previous version. We do not apply the patch file to the current version directly since it would be overwritten.

This is all the magic. In order to make this script even more useful and safe, you can compress and decompress the files and use hashing to check whether applying the patch was successful. Let’s add hashing to our script (diff-binary.sh):

#!/bin/bash

# [...]

md5sum "$oldfile" > "${oldfile}.md5"
rsync -ar --only-write-batch="$patchfile" "$oldfile" "$newfile"
rm ${patchfile}.sh

And this is how it can be checked after applying the patch (patch-binary.sh):

#!/bin/bash

# [...]

cp "$fullfile" "$patchedfile"
rsync -ar --read-batch="$patchfile" "$patchedfile"
md5sum -c "${patchedfile}.md5"
exit $?

Exciting 🤓.

Leave a Reply

Your email address will not be published. Required fields are marked *