Inner join two `HashMap`s in Rust

Question

Suppose we have two std::collections::HashMap-s, say HashMap and HashMap, and a function Fn(V1, V2) -> R.

How do I perform an inner join on these hashmaps so that I get HashMap on their shared keys as a result? Here's a reference implementation to illustrate what I mean:

use std::collections::HashMap;

fn main() {
    let mut map_a = HashMap::<&str, i32>::new();
    map_a.insert("foo", 1);
    map_a.insert("bar", 2);
    map_a.insert("qux", 3);
    map_a.insert("quux", 4);
    
    let mut map_b = HashMap::<&str, i32>::new();
    map_b.insert("foo", 5);
    map_b.insert("bar", 6);
    map_b.insert("quuux", 7);
    map_b.insert("quuuux", 8);
    
    // To keep it simple I'm just combining respective values into a tuple:
    let joined_map = map_a
        .into_iter()
        .filter_map(|(key, value_a)| map_b.remove(key).map(|value_b| (key, (value_a, value_b))))
        .collect::>();
        
    println!("{:?}", joined_map);
}

An expected output is:

{"foo": (1, 5), "bar": (2, 6)}

This works fine and I can make it generic in a utils.rs, it just doesn't feel right that I can't find an existing implementation of this in the standard library or on crates.io. Also, it seems suboptimal not to perform kind of sort-merge join, because we already have sorted hash codes underneath HashMap. Am I overseeing something? Or just being too picky because the lookups are O(1) anyway?

Kevin Reid · Accepted Answer

As the answer you linked mentions, for a hash table, which does not have an inherent ordering, there is no way to do this more efficiently. (Using the hash codes of the objects do not help, because unless you've used an alternate hasher, every HashMap uses a different hash function, to prevent denial-of-service attacks which submit deliberately colliding hash keys.)

There is one high-level improvement you can make to the algorithm: iterate over the smaller of the two hash tables, since that requires the fewest lookups.

If you have a sorted collection, such as a BTreeMap, then itertools::Itertools::merge_join_by can help you implement a merge of sorted data:

use std::collections::BTreeMap;
use itertools::{Itertools, EitherOrBoth};

fn main() {
    let mut map_a = BTreeMap::<&str, i32>::new();
    map_a.insert("foo", 1);
    map_a.insert("bar", 2);
    map_a.insert("qux", 3);
    map_a.insert("quux", 4);
    
    let mut map_b = BTreeMap::<&str, i32>::new();
    map_b.insert("foo", 5);
    map_b.insert("bar", 6);
    map_b.insert("quuux", 7);
    map_b.insert("quuuux", 8);
    
    let joined_map = map_a
        .into_iter()
        .merge_join_by(map_b, |(key_a, _), (key_b, _)| Ord::cmp(key_a, key_b))
        .filter_map(|elem| match elem {
            // Keep only paired elements, discard all others (making this an inner join).
            EitherOrBoth::Both((k, val_a), (_, val_b)) => Some((k, (val_a, val_b))),
            _ => None,
        })
        .collect::>();
        
    println!("{:?}", joined_map);
}

This is a performance improvement over individual lookups, if the maps are of comparable size, since BTreeMap's lookups are O(log(n)), so the whole join algorithm would be O(n log(n)) naively and is instead O(n) using merge, where n is the maximum size of the two maps.

On the other hand, if one of the maps is much smaller, then iterating over that smaller map is a better algorithm, since O(n log(n)) is better than O(m) if n is smaller than m by a factor of log(n).

In summary, there is more than one possible algorithm, and which one is best depends on the collection type and the relative collection sizes. And you can use sortedness to speed up a join, but it's only beneficial if you already have sortedness, which you don't in a hash map.

Inner join two `HashMap`s in Rust

Answers (1)

Related Questions