github · itsibitzi · Sep 15, 2025 · Jan 19, 2026 · Jan 19, 2026 · Jan 19, 2026
@@ -3,12 +3,12 @@
 members = [
     "crates/*",
     "crates/bpe/benchmarks",
-    "crates/bpe/tests",
+    "crates/bpe/tests"
 ]
 resolver = "2"
 
 [profile.bench]
 debug = true
 
 [profile.release]
-debug = true
+debug = true
@@ -0,0 +1,23 @@
+[package]
+name = "hriblt"
+version = "0.1.0"
+edition = "2024"
+description = "Algorithm for rateless set reconciliation"
+repository = "https://github.com/github/rust-gems"
+license = "MIT"
+keywords = ["set-reconciliation", "sync", "algorithm", "probabilistic"]
+categories = ["algorithms", "data-structures", "mathematics", "science"]
+
+[dependencies]
+thiserror = "2"
+clap = { version = "4", features = ["derive"], optional = true }
+rand = { version = "0.9", optional = true }
+
+[[bin]]
+name = "hriblt-bench"
+path = "evaluation/bench.rs"
+required-features = ["bin"]
+
+[features]
+default = []
+bin = ["dep:clap", "dep:rand"]
@@ -0,0 +1,55 @@
+# Hierarchical Rateless Bloom Lookup Tables
+
+A novel algorithm for computing the symmetric difference between sets where the amount of data shared is proportional to the size of the difference in the sets rather than proportional to the overall size.
+
+## Usage
+
+Add the library to your `Cargo.toml` file.
+
+```toml
+[dependencies]
+hriblt = "0.1"
+```
+
+Create two encoding sessions, one containing Alice's data, and another containing Bob's data. Bob's data might have been sent to you over a network for example.
+
+The following example attempts to reconcile the differences between two such sets of `u64` integers, and is done from the perspective of "Bob", who has received some symbols from "Alice".
+
+```rust
+use hriblt::{DecodingSession, EncodingSession, DefaultHashFunctions};
+// On Alice's computer
+
+// Alice creates an encoding session...
+let mut alice_encoding_session = EncodingSession::<u64, DefaultHashFunctions>::new(DefaultHashFunctions, 0..128);
+
+// And adds her data to that session, in this case the numbers from 0 to 10.
+for i in 0..=10 {
+    alice_encoding_session.insert(i);
+}
+
+// On Bob's computer
+
+// Bob creates his encoding session, note that the range **must** be the same as Alice's
+let mut bob_encoding_session = EncodingSession::<u64, DefaultHashFunctions>::new(DefaultHashFunctions, 0..128);
+
+// Bob adds his data, the numbers from 5 to 15.
+for i in 5..=15 {
+    bob_encoding_session.insert(i);
+}
+
+// "Subtract" Bob's coded symbols from Alice's, the remaining symbols will be the symmetric
+// difference between the two sets, iff we can decode them. This is a commutative function so you
+// could also subtract Alice's symbols from Bob's and it would still work.
+let merged_sessions = alice_encoding_session.merge(bob_encoding_session, true);
+
+let decoding_session = DecodingSession::from_encoding(merged_sessions);
+
+assert!(decoding_session.is_done());
+
+let mut diff = decoding_session.into_decoded_iter().map(|v| v.into_value()).collect::<Vec<_>>();
+
+diff.sort();
+
+assert_eq!(diff, [0, 1, 2, 3, 4, 11, 12, 13, 14, 15]);
+
+```
@@ -0,0 +1,21 @@
+# Hash Functions
+
+This library has a trait, `HashFunctions` which is used to create the hashes required to place your symbol into the range of coded symbols.
+
+The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation.
-The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation.
+The following documentation provides more details on this trait in particular.
-The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation.
+The following documentation provides more details on this trait in particular.
+
+## Hash stability
+
+When using HRIBLT in production systems it is important to consider the stability of your hash functions.
+
+We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.
-We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.
+We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produced by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.
-We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.
+We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produced by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.
+
+We recommend you implement your own `HashFunctions` implementation with a stable hash function.
+
+## Hash value hashing trick
+
+If the value you're inserting into the encoding session is a high entropy random value, such as a cryptographic hash digest, you can recycle the bytes in that value to produce the coded symbol indexing hashes, instead of hashing that value again. This results in a constant-factor speed up.
+
+For example if you were trying to find the difference between two sets of documents, instead of each coded symbol being the whole document it could instead just be a SHA1 hash of the document content. Since each SHA1 digest has 20 bytes of high entropy bits, instead of hashing this value five times again to produce the five coded symbol indices we can simply slice out five `u32` values from the digest itself.
+
+This is a useful trick because hash values are often used as IDs for documents during set reconciliation since they are a fixed size, making serialization easy.
@@ -0,0 +1,17 @@
+# Sizing your HRIBLT
+
+Because the HRIBLT is rateless, it is possible to append additional data in order to make it decoding possible. That is, it does not need to be sized in advance like a standard invertible bloom lookup table.
+
+Regardless, there are some advantages to getting the size of your decoding session correct the first time. An example might be if you're performing set reconciliation over some RPC and you want to minimise the number of round trips it takes to perform a decode.
+
+## Coded Symbol Multiplier
+
+The number of coded symbols required to find the difference between two sets is proportional to the difference between the two sets. The following chart shows the relationship between the number of coded symbols required to decode HRIBLT and the size of the diff. Note that the size of the base set (before diffs were added) was fixed at 1000 entries.
+
+`y = len(coded_symbols) / diff_size`
+
+![Coded symbol multiplier](../evaulation/overhead/overhead.png)
+
+For small diffs, the number of coded symbols required per value is larger, after a difference of approximately 100 values the coefficient settles on around 1.3 to 1.4.
+
+You can use this chart, combined with an estimate of the diff size (perhaps from a `geo_filter`) to increase the probability that you will have a successful decode after a single round-trip while also minimising the amount of data sent.
@@ -0,0 +1,78 @@
+# Evaluation
+
+The crate includes an overhead evaluation to measure the efficiency of the set reconciliation algorithm.
+Below are instructions on how to run the evaluation.
+
+_Note that all of these should be run from the crate's root directory!_
+
+## Overhead Evaluation
+
+The overhead evaluation measures how many coded symbols are required to successfully decode set differences of various sizes.
+The key metric is the **overhead multiplier**: the ratio of coded symbols needed to the actual diff size.
+
+See [overhead results](evaluation/overhead.md) for the predefined configurations.
+
+### Running the Evaluation
+
+1. Ensure R and ImageMagick are installed with necessary packages:
+
+   - Install R from [download](https://cran.r-project.org/) or using your platform's package manager.
+   - Start `R` and install packages by executing `install.packages(c('dplyr', 'ggplot2', 'readr', 'stringr', 'scales'))`.
+   - Install ImageMagick using the official [instructions](https://imagemagick.org/script/download.php).
+   - Install Ghostscript for PDF conversion: `brew install ghostscript` (macOS) or `apt install ghostscript` (Linux).
+
+2. Run the benchmark tool directly to see available options:
+
+       cargo run --release --features bin --bin hriblt-bench -- --help
+
+   Example: run 1000 trials with diff sizes from 1 to 100:
+
+       cargo run --release --features bin --bin hriblt-bench -- \
+           --trials 10000 \
+           --set-size 1000 \
+           --diff-size '1..101' \
+           --diff-mode incremental \
+           --tsv
+
+3. To generate a plot from TSV output:
+
+       cargo run --release --features bin --bin hriblt-bench -- \
+           --trials 10000 --diff-size '1..101' --diff-mode incremental --tsv > overhead.tsv
+
+       evaluation/plot-overhead.r overhead.tsv overhead.pdf
+
+### TSV Output Format
+
+When using `--tsv`, the benchmark outputs tab-separated data with the following columns:
+
+| Column | Description |
+|--------|-------------|
+| `trial` | Trial number (1-indexed) |
+| `set_size` | Number of elements in each set |
+| `diff_size` | Number of differences between sets |
+| `success` | Whether decoding succeeded (`true`/`false`) |
+| `coded_symbols` | Number of coded symbols needed to decode |
+| `overhead` | Ratio of coded_symbols to diff_size |
+
+### Regenerating Documentation Plots
+
+To regenerate the plots in the repository, ensure ImageMagick is available, and run:
+
+    scripts/generate-overhead-plots
+
+_Note that this will always re-run the benchmark!_
+
+## Understanding the Results
+
+The overhead plot shows percentiles of the overhead multiplier across different diff sizes:
+
+- **Gray region**: Full range (p0 to p99) - min to ~max overhead observed
+- **Blue region**: Interquartile range (p25 to p75) - where 50% of trials fall
+- **Dark line**: Median (p50) - typical overhead
+
+A lower overhead means more efficient encoding. An overhead of 1.0x would mean the number of coded symbols exactly equals the diff size (theoretical minimum).
+
+Typical results show:
+- Small diff sizes (1-10) have higher variance and overhead due to the probabilistic nature of the algorithm
+- Larger diff sizes (50+) converge to a more stable overhead around 1.3-1.5x
+- The algorithm successfully decodes 100% of trials when given up to 10x the diff size in coded symbols