Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@
members = [
"crates/*",
"crates/bpe/benchmarks",
"crates/bpe/tests",
"crates/bpe/tests"
]
resolver = "2"

[profile.bench]
debug = true

[profile.release]
debug = true
debug = true
23 changes: 23 additions & 0 deletions crates/hriblt/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
[package]
name = "hriblt"
version = "0.1.0"
edition = "2024"
description = "Algorithm for rateless set reconciliation"
repository = "https://github.com/github/rust-gems"
license = "MIT"
keywords = ["set-reconciliation", "sync", "algorithm", "probabilistic"]
categories = ["algorithms", "data-structures", "mathematics", "science"]

[dependencies]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also want some benchmarks to show throughput

thiserror = "2"
clap = { version = "4", features = ["derive"], optional = true }
rand = { version = "0.9", optional = true }

[[bin]]
name = "hriblt-bench"
path = "evaluation/bench.rs"
required-features = ["bin"]

[features]
default = []
bin = ["dep:clap", "dep:rand"]
55 changes: 55 additions & 0 deletions crates/hriblt/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Hierarchical Rateless Bloom Lookup Tables

A novel algorithm for computing the symmetric difference between sets where the amount of data shared is proportional to the size of the difference in the sets rather than proportional to the overall size.

## Usage

Add the library to your `Cargo.toml` file.

```toml
[dependencies]
hriblt = "0.1"
```

Create two encoding sessions, one containing Alice's data, and another containing Bob's data. Bob's data might have been sent to you over a network for example.

The following example attempts to reconcile the differences between two such sets of `u64` integers, and is done from the perspective of "Bob", who has received some symbols from "Alice".

```rust
use hriblt::{DecodingSession, EncodingSession, DefaultHashFunctions};
// On Alice's computer

// Alice creates an encoding session...
let mut alice_encoding_session = EncodingSession::<u64, DefaultHashFunctions>::new(DefaultHashFunctions, 0..128);

// And adds her data to that session, in this case the numbers from 0 to 10.
for i in 0..=10 {
alice_encoding_session.insert(i);
}

// On Bob's computer

// Bob creates his encoding session, note that the range **must** be the same as Alice's
let mut bob_encoding_session = EncodingSession::<u64, DefaultHashFunctions>::new(DefaultHashFunctions, 0..128);

// Bob adds his data, the numbers from 5 to 15.
for i in 5..=15 {
bob_encoding_session.insert(i);
}

// "Subtract" Bob's coded symbols from Alice's, the remaining symbols will be the symmetric
// difference between the two sets, iff we can decode them. This is a commutative function so you
// could also subtract Alice's symbols from Bob's and it would still work.
let merged_sessions = alice_encoding_session.merge(bob_encoding_session, true);

let decoding_session = DecodingSession::from_encoding(merged_sessions);

assert!(decoding_session.is_done());

let mut diff = decoding_session.into_decoded_iter().map(|v| v.into_value()).collect::<Vec<_>>();

diff.sort();

assert_eq!(diff, [0, 1, 2, 3, 4, 11, 12, 13, 14, 15]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to show here that it ALSO tells you whether a value was Inserted or Removed.
Maybe use the sign of the value for demonstration purposes?


```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also want an example showing the hierarchical aspect of this.
I.e. both sides compute an encoding session of size 1024 and convert it into a hierarchy. Then, one side transfers 128 blocks at a time until decoding succeeds.

21 changes: 21 additions & 0 deletions crates/hriblt/docs/hashing_functions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Hash Functions

This library has a trait, `HashFunctions` which is used to create the hashes required to place your symbol into the range of coded symbols.

The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation.
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation reference to a non-existent file: The documentation mentions "How and why this is done is explained in the overview.md documentation" but no such file exists in the docs directory. This reference should either be removed or the file should be created.

Suggested change
The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation.
The following documentation provides more details on this trait in particular.

Copilot uses AI. Check for mistakes.

## Hash stability

When using HRIBLT in production systems it is important to consider the stability of your hash functions.

We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.
We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produced by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.


We recommend you implement your own `HashFunctions` implementation with a stable hash function.

## Hash value hashing trick

If the value you're inserting into the encoding session is a high entropy random value, such as a cryptographic hash digest, you can recycle the bytes in that value to produce the coded symbol indexing hashes, instead of hashing that value again. This results in a constant-factor speed up.

For example if you were trying to find the difference between two sets of documents, instead of each coded symbol being the whole document it could instead just be a SHA1 hash of the document content. Since each SHA1 digest has 20 bytes of high entropy bits, instead of hashing this value five times again to produce the five coded symbol indices we can simply slice out five `u32` values from the digest itself.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a feature flag which provides the SHA1 functionality for you...


This is a useful trick because hash values are often used as IDs for documents during set reconciliation since they are a fixed size, making serialization easy.
17 changes: 17 additions & 0 deletions crates/hriblt/docs/sizing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Sizing your HRIBLT

Because the HRIBLT is rateless, it is possible to append additional data in order to make it decoding possible. That is, it does not need to be sized in advance like a standard invertible bloom lookup table.

Regardless, there are some advantages to getting the size of your decoding session correct the first time. An example might be if you're performing set reconciliation over some RPC and you want to minimise the number of round trips it takes to perform a decode.

## Coded Symbol Multiplier

The number of coded symbols required to find the difference between two sets is proportional to the difference between the two sets. The following chart shows the relationship between the number of coded symbols required to decode HRIBLT and the size of the diff. Note that the size of the base set (before diffs were added) was fixed at 1000 entries.

`y = len(coded_symbols) / diff_size`

![Coded symbol multiplier](../evaulation/overhead/overhead.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the file path got messed up :)
evaulation


For small diffs, the number of coded symbols required per value is larger, after a difference of approximately 100 values the coefficient settles on around 1.3 to 1.4.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the graph should probably be logarithmic in the diff size to show that the variance goes down with larger n!
I.e. for large (>10000) the variance is almost completely gone.


You can use this chart, combined with an estimate of the diff size (perhaps from a `geo_filter`) to increase the probability that you will have a successful decode after a single round-trip while also minimising the amount of data sent.
78 changes: 78 additions & 0 deletions crates/hriblt/evaluation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Evaluation

The crate includes an overhead evaluation to measure the efficiency of the set reconciliation algorithm.
Below are instructions on how to run the evaluation.

_Note that all of these should be run from the crate's root directory!_

## Overhead Evaluation

The overhead evaluation measures how many coded symbols are required to successfully decode set differences of various sizes.
The key metric is the **overhead multiplier**: the ratio of coded symbols needed to the actual diff size.

See [overhead results](evaluation/overhead.md) for the predefined configurations.

### Running the Evaluation

1. Ensure R and ImageMagick are installed with necessary packages:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: copilot created me a graph with the plotter crate.
Should we use that instead?


- Install R from [download](https://cran.r-project.org/) or using your platform's package manager.
- Start `R` and install packages by executing `install.packages(c('dplyr', 'ggplot2', 'readr', 'stringr', 'scales'))`.
- Install ImageMagick using the official [instructions](https://imagemagick.org/script/download.php).
- Install Ghostscript for PDF conversion: `brew install ghostscript` (macOS) or `apt install ghostscript` (Linux).

2. Run the benchmark tool directly to see available options:

cargo run --release --features bin --bin hriblt-bench -- --help

Example: run 1000 trials with diff sizes from 1 to 100:

cargo run --release --features bin --bin hriblt-bench -- \
--trials 10000 \
--set-size 1000 \
--diff-size '1..101' \
--diff-mode incremental \
--tsv

3. To generate a plot from TSV output:

cargo run --release --features bin --bin hriblt-bench -- \
--trials 10000 --diff-size '1..101' --diff-mode incremental --tsv > overhead.tsv

evaluation/plot-overhead.r overhead.tsv overhead.pdf

### TSV Output Format

When using `--tsv`, the benchmark outputs tab-separated data with the following columns:

| Column | Description |
|--------|-------------|
| `trial` | Trial number (1-indexed) |
| `set_size` | Number of elements in each set |
| `diff_size` | Number of differences between sets |
| `success` | Whether decoding succeeded (`true`/`false`) |
| `coded_symbols` | Number of coded symbols needed to decode |
| `overhead` | Ratio of coded_symbols to diff_size |

### Regenerating Documentation Plots

To regenerate the plots in the repository, ensure ImageMagick is available, and run:

scripts/generate-overhead-plots

_Note that this will always re-run the benchmark!_

## Understanding the Results

The overhead plot shows percentiles of the overhead multiplier across different diff sizes:

- **Gray region**: Full range (p0 to p99) - min to ~max overhead observed
- **Blue region**: Interquartile range (p25 to p75) - where 50% of trials fall
- **Dark line**: Median (p50) - typical overhead

A lower overhead means more efficient encoding. An overhead of 1.0x would mean the number of coded symbols exactly equals the diff size (theoretical minimum).

Typical results show:
- Small diff sizes (1-10) have higher variance and overhead due to the probabilistic nature of the algorithm
- Larger diff sizes (50+) converge to a more stable overhead around 1.3-1.5x
- The algorithm successfully decodes 100% of trials when given up to 10x the diff size in coded symbols
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10x the diff size isn't a great outcome :)
Mention here the min trick I described above (and which we use in production)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually expect that we need AT MOST 50% overhead after some diff size.

Loading
Loading