-
Notifications
You must be signed in to change notification settings - Fork 14
Open source HRIBLT #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| [package] | ||
| name = "hriblt" | ||
| version = "0.1.0" | ||
| edition = "2024" | ||
| description = "Algorithm for rateless set reconciliation" | ||
| repository = "https://github.com/github/rust-gems" | ||
| license = "MIT" | ||
| keywords = ["set-reconciliation", "sync", "algorithm", "probabilistic"] | ||
| categories = ["algorithms", "data-structures", "mathematics", "science"] | ||
|
|
||
| [dependencies] | ||
| thiserror = "2" | ||
| clap = { version = "4", features = ["derive"], optional = true } | ||
| rand = { version = "0.9", optional = true } | ||
|
|
||
| [[bin]] | ||
| name = "hriblt-bench" | ||
| path = "evaluation/bench.rs" | ||
| required-features = ["bin"] | ||
|
|
||
| [features] | ||
| default = [] | ||
| bin = ["dep:clap", "dep:rand"] | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| # Hierarchical Rateless Bloom Lookup Tables | ||
|
|
||
| A novel algorithm for computing the symmetric difference between sets where the amount of data shared is proportional to the size of the difference in the sets rather than proportional to the overall size. | ||
|
|
||
| ## Usage | ||
|
|
||
| Add the library to your `Cargo.toml` file. | ||
|
|
||
| ```toml | ||
| [dependencies] | ||
| hriblt = "0.1" | ||
| ``` | ||
|
|
||
| Create two encoding sessions, one containing Alice's data, and another containing Bob's data. Bob's data might have been sent to you over a network for example. | ||
|
|
||
| The following example attempts to reconcile the differences between two such sets of `u64` integers, and is done from the perspective of "Bob", who has received some symbols from "Alice". | ||
|
|
||
| ```rust | ||
| use hriblt::{DecodingSession, EncodingSession, DefaultHashFunctions}; | ||
| // On Alice's computer | ||
|
|
||
| // Alice creates an encoding session... | ||
| let mut alice_encoding_session = EncodingSession::<u64, DefaultHashFunctions>::new(DefaultHashFunctions, 0..128); | ||
|
|
||
| // And adds her data to that session, in this case the numbers from 0 to 10. | ||
| for i in 0..=10 { | ||
| alice_encoding_session.insert(i); | ||
| } | ||
|
|
||
| // On Bob's computer | ||
|
|
||
| // Bob creates his encoding session, note that the range **must** be the same as Alice's | ||
| let mut bob_encoding_session = EncodingSession::<u64, DefaultHashFunctions>::new(DefaultHashFunctions, 0..128); | ||
|
|
||
| // Bob adds his data, the numbers from 5 to 15. | ||
| for i in 5..=15 { | ||
| bob_encoding_session.insert(i); | ||
| } | ||
|
|
||
| // "Subtract" Bob's coded symbols from Alice's, the remaining symbols will be the symmetric | ||
| // difference between the two sets, iff we can decode them. This is a commutative function so you | ||
| // could also subtract Alice's symbols from Bob's and it would still work. | ||
| let merged_sessions = alice_encoding_session.merge(bob_encoding_session, true); | ||
|
|
||
| let decoding_session = DecodingSession::from_encoding(merged_sessions); | ||
|
|
||
| assert!(decoding_session.is_done()); | ||
|
|
||
| let mut diff = decoding_session.into_decoded_iter().map(|v| v.into_value()).collect::<Vec<_>>(); | ||
|
|
||
| diff.sort(); | ||
|
|
||
| assert_eq!(diff, [0, 1, 2, 3, 4, 11, 12, 13, 14, 15]); | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You want to show here that it ALSO tells you whether a value was Inserted or Removed. |
||
|
|
||
| ``` | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we also want an example showing the hierarchical aspect of this. |
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,21 @@ | ||||||
| # Hash Functions | ||||||
|
|
||||||
| This library has a trait, `HashFunctions` which is used to create the hashes required to place your symbol into the range of coded symbols. | ||||||
|
|
||||||
| The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation. | ||||||
|
||||||
| The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation. | |
| The following documentation provides more details on this trait in particular. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust. | |
| We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produced by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a feature flag which provides the SHA1 functionality for you...
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| # Sizing your HRIBLT | ||
|
|
||
| Because the HRIBLT is rateless, it is possible to append additional data in order to make it decoding possible. That is, it does not need to be sized in advance like a standard invertible bloom lookup table. | ||
|
|
||
| Regardless, there are some advantages to getting the size of your decoding session correct the first time. An example might be if you're performing set reconciliation over some RPC and you want to minimise the number of round trips it takes to perform a decode. | ||
|
|
||
| ## Coded Symbol Multiplier | ||
|
|
||
| The number of coded symbols required to find the difference between two sets is proportional to the difference between the two sets. The following chart shows the relationship between the number of coded symbols required to decode HRIBLT and the size of the diff. Note that the size of the base set (before diffs were added) was fixed at 1000 entries. | ||
|
|
||
| `y = len(coded_symbols) / diff_size` | ||
|
|
||
|  | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. looks like the file path got messed up :) |
||
|
|
||
| For small diffs, the number of coded symbols required per value is larger, after a difference of approximately 100 values the coefficient settles on around 1.3 to 1.4. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the graph should probably be logarithmic in the diff size to show that the variance goes down with larger n! |
||
|
|
||
| You can use this chart, combined with an estimate of the diff size (perhaps from a `geo_filter`) to increase the probability that you will have a successful decode after a single round-trip while also minimising the amount of data sent. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| # Evaluation | ||
|
|
||
| The crate includes an overhead evaluation to measure the efficiency of the set reconciliation algorithm. | ||
| Below are instructions on how to run the evaluation. | ||
|
|
||
| _Note that all of these should be run from the crate's root directory!_ | ||
|
|
||
| ## Overhead Evaluation | ||
|
|
||
| The overhead evaluation measures how many coded symbols are required to successfully decode set differences of various sizes. | ||
| The key metric is the **overhead multiplier**: the ratio of coded symbols needed to the actual diff size. | ||
|
|
||
| See [overhead results](evaluation/overhead.md) for the predefined configurations. | ||
|
|
||
| ### Running the Evaluation | ||
|
|
||
| 1. Ensure R and ImageMagick are installed with necessary packages: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. question: copilot created me a graph with the plotter crate. |
||
|
|
||
| - Install R from [download](https://cran.r-project.org/) or using your platform's package manager. | ||
| - Start `R` and install packages by executing `install.packages(c('dplyr', 'ggplot2', 'readr', 'stringr', 'scales'))`. | ||
| - Install ImageMagick using the official [instructions](https://imagemagick.org/script/download.php). | ||
| - Install Ghostscript for PDF conversion: `brew install ghostscript` (macOS) or `apt install ghostscript` (Linux). | ||
|
|
||
| 2. Run the benchmark tool directly to see available options: | ||
|
|
||
| cargo run --release --features bin --bin hriblt-bench -- --help | ||
|
|
||
| Example: run 1000 trials with diff sizes from 1 to 100: | ||
|
|
||
| cargo run --release --features bin --bin hriblt-bench -- \ | ||
| --trials 10000 \ | ||
| --set-size 1000 \ | ||
| --diff-size '1..101' \ | ||
| --diff-mode incremental \ | ||
| --tsv | ||
|
|
||
| 3. To generate a plot from TSV output: | ||
|
|
||
| cargo run --release --features bin --bin hriblt-bench -- \ | ||
| --trials 10000 --diff-size '1..101' --diff-mode incremental --tsv > overhead.tsv | ||
|
|
||
| evaluation/plot-overhead.r overhead.tsv overhead.pdf | ||
|
|
||
| ### TSV Output Format | ||
|
|
||
| When using `--tsv`, the benchmark outputs tab-separated data with the following columns: | ||
|
|
||
| | Column | Description | | ||
| |--------|-------------| | ||
| | `trial` | Trial number (1-indexed) | | ||
| | `set_size` | Number of elements in each set | | ||
| | `diff_size` | Number of differences between sets | | ||
| | `success` | Whether decoding succeeded (`true`/`false`) | | ||
| | `coded_symbols` | Number of coded symbols needed to decode | | ||
| | `overhead` | Ratio of coded_symbols to diff_size | | ||
|
|
||
| ### Regenerating Documentation Plots | ||
|
|
||
| To regenerate the plots in the repository, ensure ImageMagick is available, and run: | ||
|
|
||
| scripts/generate-overhead-plots | ||
|
|
||
| _Note that this will always re-run the benchmark!_ | ||
|
|
||
| ## Understanding the Results | ||
|
|
||
| The overhead plot shows percentiles of the overhead multiplier across different diff sizes: | ||
|
|
||
| - **Gray region**: Full range (p0 to p99) - min to ~max overhead observed | ||
| - **Blue region**: Interquartile range (p25 to p75) - where 50% of trials fall | ||
| - **Dark line**: Median (p50) - typical overhead | ||
|
|
||
| A lower overhead means more efficient encoding. An overhead of 1.0x would mean the number of coded symbols exactly equals the diff size (theoretical minimum). | ||
|
|
||
| Typical results show: | ||
| - Small diff sizes (1-10) have higher variance and overhead due to the probabilistic nature of the algorithm | ||
| - Larger diff sizes (50+) converge to a more stable overhead around 1.3-1.5x | ||
| - The algorithm successfully decodes 100% of trials when given up to 10x the diff size in coded symbols | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 10x the diff size isn't a great outcome :)
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would actually expect that we need AT MOST 50% overhead after some diff size. |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also want some benchmarks to show throughput