snapdiff
CLI tool, written in Rust, to analyse the difference between two directory snapshots
10 min. read

I periodically create snapshots of large directory trees for backup purposes. The snapshotting procedure is basically cp -R, so each snapshot is a full copy of the directory tree in question.

I archive these snapshots either on my internal disk or on an external one. In any event, I end up with a whole bunch of snapshots – all representing the same directory tree, but at different points in time.

What I like about this backup process is that the copy mechanism itself is simple and reliable. The snapshots are functional and self-contained, and I can access the backup data independent of specific tooling (or even independent of the operating system).

However, one thing that slightly worries me is that I might accidentally lose or mess up data in between snapshot events. Since I cycle and eventually purge old snapshots to free up disk space for new ones, this could subtly lead to data loss in the long run, without me noticing about it.

My goal was to get some insight into how a directory tree has evolved over time.1 For this purpose, I created a CLI tool that can analyse the difference between two snapshots (on file level), and summarises the changes in a concise report.

In this blog post, I demonstrate usage of the CLI tool, outline the underlying diffing algorithm, and discuss performance characteristics of the implementation.

snapdiff

The tool is called snappdiff (short for “snapshot difference”), and is implemented in Rust. You can find its sources on GitHub (beware of rough edges, though!). snapdiff compares two directory trees, and then summarises how many files are identical, and how many have been modified, moved, added, or deleted.

In terms of scale, the tool is build to support snapshots containing some 100.000 files of any size. This is roughly the dimensions that I need for my own backups, which can typically contain such a number of files, and a few 100 GB of data overall.

As example, let’s say we wanted to compare two snapshots of a particular directory tree, which have been taken one month apart from each other. With the snapshot folders being called 2023-09-01/ and 2023-10-01/2, the execution of snapdiff would look like this:

$ snapdiff 2023-09-01/ 2023-10-01/

                           FILES             BYTES
                                     G   M   K   B
TOTAL       Snap 1        87,243    98,407,188,994
            Snap 2        87,319    98,591,738,372
            
OF WHICH    Identical     87,134    97,551,550,976
            Moved             38       134,217,728
            Added             87       234,881,024
            Deleted           11        50,331,648
            Modified         147       671,088,644 (+282,172)

Via an optional --report flag, snapdiff can also write a more detailed summary into a file (enumerating all files that don’t fall into the “identical” bucket).

The categories are defined as:

Sample visualisation of two directory trees

Diffing Algorithm

snapdiff determines the snapshot diff without any historical information or file metadata, but solely on the basis of the file paths and contents, as they are on disk in the moment of execution.3

It performs a total of two directory traversals (one per snapshot), so each file is processed once. The algorithm goes like this:

  1. Traverse the first snapshot directory tree. For each file, compute the hash of its contents, and store it in a lookup table. (This initial lookup table allows lookup by path and lookup by contents hash).
  2. Traverse the second snapshot directory tree. For each file, compute the hash of its contents, and try to lookup the file by its path in the initial lookup table (the one from step 1).
    • If there is a file at the same path with the same hash, count the file as “identical”.
    • If there is a file at the same path with a different hash, count the file as “modified”.
    • In any of these two cases, remove the file from the initial lookup table.
    • Otherwise, store the file in a remainder list.
  3. For each file in the remainder list, try to lookup the file by its contents hash in the initial lookup table.
    • If there is a file, count the file as “moved”. (Note: it’s impossible for the file to be “identical” at this point.)
    • Otherwise, count the file as “added”.
    • In any of these two cases, remove the file from the initial lookup table.
  4. For each file that is now still left in the initial lookup table, count it as “deleted”.

By the way, considering the intended use-case of snapdiff, we can heuristically expect most files to be identical.

Edge-cases

There are a few edge-cases, which the above algorithm manages to avoid. These are mainly related to file (contents) duplication, and the order in which the files are processed.

For example, say snapshot 1 contains a duplicate file (i.e., with same contents) at locations /a and /b. In snapshot 2, however, /a has been renamed (moved) to /c, so snapshot 2 contains /b and /c. A naive comparison could incorrectly count two moved files, instead of one identical and one moved file.

SNAP 1        SNAP 2
/a
/b            /b
              /c
Undesired:  a → b  and  b → c
Expected:   b = b  and  a → c

A similar thing could happen if snapshot 1 contains a file at location /b, which is then duplicated to /a in snapshot 2, so that snapshot 2 contains two files /a and /b with the same contents. In this case, the algorithm should report one file identical and one file added, instead of one file moved and one file added.

SNAP 1        SNAP 2
              /a
/b            /b
Undesired:  b → a  and  +b
Expected:   b = b  and  +a

Performance considerations

Preface: I carried out the following performance measurements on a machine with 3,5 GHz 10-core ARM CPU and a built-in SSD with ~4 GB/s read speed. The sample snapshot for testing comprises 100.000 files with an overall size of 100 GB.

For simplicity, all below measurements are gated on these reference numbers.

snapdiff has an overall throughput rate of around 2.5 GB/s. Processing the test sample therefore takes around 75 seconds in total – 35 seconds per 100 GB snapshot, plus a few seconds of additional overhead.

The implementation is designed to spread the work evenly across all available CPU cores, and each core operates at nearly full capacity (~95%) during runtime. The overall throughput rate scales almost linearly over the cores, so you can roughly say that each core contributes ~250 MB/s throughput.

Memory

Due to the expected volume of files and bytes, it’s not possible to keep all files in memory. Therefore, snapdiff computes hashes of the file contents as basis for the comparison, and stores the hashes in memory.

However, for computing the hash, it’s actually not even feasible to copy one file into memory in its entirety, since the individual file size can easily be bigger than the available memory. Therefore, file contents is read chunk-wise, which means that the hashes are computed incrementally.

Memory consumption peaks at around 115 MB. The memory footprint consists of two aspects: the chunk size is 10 MB, so 100 MB of the process memory are due to 10 cores allocating a 10 MB chunk each. The rest of the memory consumption (~15 MB) is mainly for building up and holding the lookup tables.

Increasing the chunk size doesn’t have any measurable effect on performance.

Hashing

As far as the CPU is concerned, the speed bottleneck number one is – by far – the hash computation. Hence, the choice (and implementation) of the hashing algorithm is pivotal. snapdiff uses a CRC algorithm,4 which is relatively fast, and also guarantees even distribution of the hashes.

CRC is available with various hash lengths, and choosing the right length is a balancing act between computational speed and the risk of hash collisions. For a length of 32 bits, for example, there would be a 1% chance of a hash collision in a sample size of 10.000 files – which clearly wouldn’t be feasible for this application.

snapdiff uses a length of 64 bit, which results in a 1% chance for a collision in a sample size of 600.000.000 files. (For some extra safety margin to avoid collisions, by the way, snapdiff also takes the file size into consideration when comparing files.)

It doesn’t make sense to use cryptographic hashing algorithms such as SHA or MD5, since these would be significantly slower. Cryptographic hashing would only be relevant, if there could be a malicious third party trying to fabricate a file whose hash would collide on purpose (e.g., to sneak in a file modification but make the respective file appear as identical).

Since we read the files chunk-wise, it’s critical to compute the chunk hashes in sequential order, before combining them into the eventual file hash. This constraint implies that we cannot trivially parallelise the hashing work on chunk level, but only on file level.

Disk I/O

The influence of disk I/O on performance varies extremely: compared to the cost of hashing, disk latency is a rather minor worry for modern, built-in SSDs. However, read speed can be orders of magnitude lower on magnetic HDDs, in which case disk I/O becomes the main limiting factor.5

For example, when reading the snapshot data via USB 2 from an external HDD spinning at 7200 rpm, I can see the overall throughput rate dropping to ~50–150 MB/s. That way, the evaluation of one snapshot takes a whopping 30 minutes.

Moreover, and in contrast to SSDs, the throughput rate can also be quite erratic on HDDs: one issue is that performance can degrade if data isn’t read sequentially, because the disk’s head would have to jump around constantly.

In order to play somewhat nicely with HDDs, snapdiff allows to configure the number of utilised CPU cores (via the --worker flag). For most fine-granular control, it even let’s you specify the number of workers per snapshot side. So if one snapshot resides on an external HDD, whereas the other is on a built-in SSD, you can throttle snapdiff down to one core for the HDD, while going full steam on the SSD.

Heuristics

snapdiff accounts for a few heuristics to optimise performance. One such corner case is when the snapshots contain both small files and also very large ones. In this scenario, it might occur that one of the large files happen to be processed at the very end. Since we can only parallelise across files, not across file chunks, this would create a situation where one CPU core is left hashing the last (large) file, while all other cores helplessly idle around (as the file queue is empty at this point).

There would be various approaches to avoid or alleviate this problem. What snapdiff does is to perform an initial scanning step, which allows it to prioritise the hashing work by file size in descending order. That way, the hashes of large files are computed at the very beginning, thus making sure that there will be enough work towards the end to “fill the gaps”.

The performance overhead of the initial scanning step is barely noticeable (it only retrieves file metadata).


  1. In my case, I’m primarily interested in diffing the latest snapshot with the second-to-latest one. ↩︎

  2. When I create a snapshot, I usually name the snapshot folder (i.e., the copy of the original directory tree) like the date at which it was taken. ↩︎

  3. The “stateless” comparison has limitations, of course: e.g., if you’d modify a file and then also rename (or move) it, then snappdiff would report that as one file deleted and one added. ↩︎

  4. CRC means “cyclic redundancy check”, and it’s most widely known for failure detection in network or storage engines. ↩︎

  5. External SSDs are somewhere in between HDDs and built-in SSDs, and their read performance largely depends on the characteristics of the connector interface. ↩︎

My e-mail is:
Copy to Clipboard