Hash Maps and Hash Sets: Key-Value Storage

About This Module

This module introduces HashMap and HashSet collections in Rust, which provide efficient key-value storage and set operations. You'll learn how to use these collections for fast lookups, counting, and deduplication tasks common in data processing.

Prework

Prework Reading

Please read the following sections from The Rust Programming Language Book:

Pre-lecture Reflections -- Part 1

  1. Why must a HashMap take ownership of values like String, and what memory safety problems does this solve?
  2. How does the entry API help you safely update a value?
  3. The get method returns an Option. Why is this a crucial design choice, and what common bugs does it prevent?
  4. When would you choose to use a HashMap over a Vec, and what is the main performance trade-off for looking up data?

Pre-lecture Reflections -- Part 2

  1. How do hash maps achieve O(1) average-case lookup time?
  2. What are the tradeoffs between HashMap and BTreeMap in Rust?
  3. When would you use a HashSet vs a Vec for storing unique values?
  4. What makes a good hash function?

Learning Objectives

By the end of this module, you will be able to:

  • Create and manipulate HashMap and HashSet collections
  • Understand hash table operations and their complexity
  • Choose appropriate collection types for different use cases
  • Handle hash collisions and understand their implications

Hash maps

Collection HashMap<K,V>

Goal: a mapping from elements of K to elements of V

  • elements of K called keys -- must be unique
  • elements of V called values -- need not be unique

Similar structure in other languages:

  • Python: dictionaries
  • C++: unordered_map<K,V>
  • Java: Hashtable<K,T>

Creating a HashMap

  • Create a hash map and insert key-value pairs
  • Extract a reference with .get()
#![allow(unused)]
fn main() {
use std::collections::HashMap;

// number of wins in a local Counterstrike league
let mut wins = HashMap::<String,u16>::new();

// Insert creates a new key/value if exists and overwrites old value if key exists
wins.insert(String::from("Boston University"),24);
wins.insert(String::from("Harvard"),22);
wins.insert(String::from("Boston College"),20);
wins.insert(String::from("Northeastern"),32);

// Extracting a reference: returns `Option<&V>`

println!("Boston University wins: {:?}", wins.get("Boston University"));
println!("MIT wins: {:?}", wins.get("MIT"));
}

Inserting a key-value pair if not present

To check if a key is present, and if not, insert a default value, you can use .entry().or_insert().

#![allow(unused)]
fn main() {
use std::collections::HashMap;

// number of wins in a local Counterstrike league
let mut wins = HashMap::<String,u16>::new();

// Insert creates a new key/value if exists and overwrites old value if key exists
wins.insert(String::from("Boston University"),24);
wins.insert(String::from("Harvard"),22);
wins.insert(String::from("Boston College"),20);
wins.insert(String::from("Northeastern"),32);

//Insert if not present, you can use `.entry().or_insert()`.

wins.entry(String::from("MIT")).or_insert(10);
println!("MIT wins: {:?}", wins.get("MIT"));
}

Updating a value based on the old value

To update a value based on the old value, you can use .entry().or_insert() and get a mutable reference to the value.

#![allow(unused)]
fn main() {
use std::collections::HashMap;

// number of wins in a local Counterstrike league
let mut wins = HashMap::<String,u16>::new();

// Insert creates a new key/value if exists and overwrites old value if key exists
wins.insert(String::from("Boston University"),24);
wins.insert(String::from("Harvard"),22);
wins.insert(String::from("Boston College"),20);
wins.insert(String::from("Northeastern"),32);

// Updating a value based on the old value:
println!("Boston University wins: {:?}", wins.get("Boston University"));

{ // code block to limit how long the reference lasts
    let entry = wins.entry(String::from("Boston University")).or_insert(10);
    *entry += 50;
}
//wins.insert(String::from("Boston University"),24);
println!("Boston University wins: {:?}", wins.get("Boston University"));
}

Iterating

You can iterate over each key-value pair with a for loop similar to vectors.

#![allow(unused)]
fn main() {
use std::collections::HashMap;

// number of wins in a local Counterstrike league
let mut wins = HashMap::<String,u16>::new();

// Insert creates a new key/value if exists and overwrites old value if key exists
wins.insert(String::from("Boston University"),24);
wins.insert(String::from("Harvard"),22);
wins.insert(String::from("Boston College"),20);
wins.insert(String::from("Northeastern"),32);

for (k,v) in &wins {
    println!("{}: {}",k,v);
};

println!("\nUse .iter(): ");
for (k,v) in wins.iter() {
    println!("{}: {}",k,v);
};
}

Iterating and Modifying Values

To modify values, you have to use mutable versions:

#![allow(unused)]
fn main() {
use std::collections::HashMap;

// number of wins in a local Counterstrike league
let mut wins = HashMap::<String,u16>::new();

// Insert creates a new key/value if exists and overwrites old value if key exists
wins.insert(String::from("Boston University"),24);
wins.insert(String::from("Harvard"),22);
wins.insert(String::from("Boston College"),20);
wins.insert(String::from("Northeastern"),32);

for (k,v) in &wins {    // note: `wins` is immutable, so we can't mutate the values
    println!("{}: {}",k,v);
};

println!("\nUse implicit mutable iterator: ");
for (k,v) in &mut wins {    // note: `wins` is mutable, so we can mutate the values
    *v += 1;
    println!("{}: {}",k,v);
};

println!("\nUse .iter_mut(): ");
for (k,v) in wins.iter_mut() {    // note: `wins` is mutable, so we can mutate the values
    *v += 1;
    println!("{}: {}",k,v);
};
}

Using HashMaps with Match statements

  • Let's use a hash map to store the price of different items in a cafe
  • Use .get(&key) to get the value of a key wrapped in an Option enum.
#![allow(unused)]
fn main() {
use std::collections::HashMap;

let mut crispy_crêpes_café = HashMap::new();
crispy_crêpes_café.insert(String::from("Nutella Crêpe"),5.85);
crispy_crêpes_café.insert(String::from("Strawberries and Nutella Crêpe"),8.75);
crispy_crêpes_café.insert(String::from("Roma Tomato, Pesto and Spinach Crêpe"),8.90);
crispy_crêpes_café.insert(String::from("Three Mushroom Crêpe"),8.90);

fn on_the_menu(cafe: &HashMap<String,f64>, key:String) {
    print!("{}: ",key);
    match cafe.get(&key) {  // .get() returns an Option enum
        None => println!("not on the menu"),
        Some(price) => println!("${:.2}",price),
    }
}
on_the_menu(&crispy_crêpes_café, String::from("Four Mushroom Crêpe"));
on_the_menu(&crispy_crêpes_café, String::from("Three Mushroom Crêpe"));
}

Summary of Useful HashMap Methods

TODO: Add details on return values.

Basic Operations:

  • new(): Creates an empty HashMap.
  • insert(key, value) -> Option<V>: Adds a key-value pair to the map. If the map did not have the key then None is returned, if the map did have the key the old value is returned as Some(V).
  • remove(key) -> Option<V>: Removes a key-value pair from the map. If the map did not have the key then None is returned, if the map did have the key the value is returned as Some(V).
  • get(key) -> Option<&V>: Returns a reference to the value in the map, if any, that is equal to the given key. Returns None if the key is not present.
  • contains_key(key) -> bool: Checks if the map contains a specific key. Returns true if present, false otherwise.
  • len() -> usize: Returns the number of key-value pairs in the map.
  • is_empty() -> bool: Checks if the map contains no key-value pairs.
  • clear(): Removes all key-value pairs from the map.
  • drain(): Clears the map and returns all key-value pairs as an iterator.

Entry API:

  • entry(key).or_insert(value): Returns a mutable reference to the value in the map, if any, that is equal to the given key. If the key is not present, inserts the given value and returns a mutable reference to the new value.

Iterators and Views:

  • iter(): Returns an immutable iterator over the key-value pairs in the map.
  • iter_mut(): Returns a mutable iterator over the key-value pairs in the map.
  • keys(): Returns an iterator over the keys in the map.
  • values(): Returns an iterator over the values in the map.
  • values_mut(): Returns a mutable iterator over the values in the map.

See the documentation for more details.

True/False Statements on HashMaps

  • In a Rust HashMap<K,V>, keys must be unique, but values need not be unique.
  • Calling .insert() on a HashMap with a key that already exists will cause a runtime panic.
  • The .get(&key) method returns a reference directly to the value, or panics if the key is missing.
  • The expression map.entry(key).or_insert(0) returns a mutable reference to the value, allowing in-place updates like *entry += 1.
  • To mutate values while iterating over a HashMap, you must use &mut map or .iter_mut() — iterating with &map yields immutable references.

Enter your answers into gradescope.

Solutions will be posted after the lecture.

How Hash Tables Work

Internal Representation

Array of Option enums of tuples (key, value, hash)

  • A hash map is represented as an array of buckets, e.g. capacity
  • The array is an array of Option<T> enums like Vec<Option<T>>) ,
  • And the Some(<T>) variant has value T with tuple (key, value, hash)
  • So the internal representation is like Vec<Option<(K, V, u64)>>

Hash function

  • Use a hash function which is like a pseudorandom number generator with key as the seed, e.g.
  • Pseudorandom means that the same key will always produce the same hash, but different keys will produce different hashes.
  • Then take modulo of capacity , e.g. index = hash % 8 = 6
  • So ultimately maps keys into one of the buckets

Hash Function Examples

Let's calculate hash and index for different inputs using Rust's built-in hash function.

use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};

fn hash_function(input: &str) -> u64 {
    let mut hasher = DefaultHasher::new();
    input.hash(&mut hasher);
    hasher.finish()
}

fn main() {
    let B = 8;  // capacity of the hash map (e.g. number of buckets)

    let input = "Hello!";
    let hash = hash_function(input);
    let index = hash % B;
    println!("Hash of '{}' is: {} and index is: {}", input, hash, index);

    let input = "Hello";  // slight change in input
    let hash = hash_function(input);
    let index = hash % B;
    println!("Hash of '{}' is: {} and index is: {}", input, hash, index);

    let input = "hello";  // slight change in input
    let hash = hash_function(input);
    let index = hash % B;
    println!("Hash of '{}' is: {} and index is: {}", input, hash, index);

}
  • Any collisions?
  • Try increasing the capacity to 16 and see how the index changes.

More Hash Function Examples

  • Keys don't have to be strings.
  • They can be any type that implements the Hash trait.
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};

fn generic_hash_function<T: Hash>(input: &T) -> u64 {
    let mut hasher = DefaultHasher::new();
    input.hash(&mut hasher);
    hasher.finish()
}

fn main() {
    // Using the generic hash function with different types
    println!("\nUsing generic_hash_function:");
    println!("String hash: {}", generic_hash_function(&"Hello, world!"));
    println!("Integer hash: {}", generic_hash_function(&42));
    // println!("Float hash: {}", generic_hash_function(&3.14)); // what if we try float?
    println!("Bool hash: {}", generic_hash_function(&true));
    println!("Tuple hash: {}", generic_hash_function(&(1, 2, 3)));
    println!("Vector hash: {}", generic_hash_function(&vec![1, 2, 3, 4, 5]));
    println!("Char hash: {}", generic_hash_function(&'A'));
}

What if you try to hash a float?

General ideas

  • Store keys (and associated values and hashes) in buckets
  • Indexing: Use hash function to find bucket holding key and value.

Collision: two keys mapped to the same bucket

  • Very unlikely given the pseudorandom nature of the hash function
  • What to do if two keys in the same bucket

Handling collisions

Probing

  • Each bucket entry: (key, value, hash)
  • Use a deterministic algorithm to find an open bucket

Inserting:

  • entry busy: try , , etc.
  • insert into first empty

Searching:

  • try , , , etc.
  • stop when found or empty entry

Handling collisions, example

Step 1

Step 1: Empty hash map with 4 buckets

Index:  0         1         2         3
      +-------+ +-------+ +-------+ +-------+
      | empty | | empty | | empty | | empty |
      +-------+ +-------+ +-------+ +-------+

Step 2

Step 2: Insert key="apple", hash("apple") = 42

hash("apple") = 42
42 % 4 = 2  ← insert at index 2

Index:  0         1         2         3
      +-------+ +-------+ +-------+ +-------+
      | empty | | empty | |apple,v| | empty |
      |       | |       | | h=42  | |       |
      +-------+ +-------+ +-------+ +-------+
                            ^
                            insert here

Step 3

Step 3: Insert key="banana", hash("banana") = 14

hash("banana") = 14
14 % 4 = 2  ← collision! index 2 is occupied, and not same key

Index:  0         1         2         3
      +-------+ +-------+ +-------+ +-------+
      | empty | | empty | |apple,v| | empty |
      |       | |       | | h=42  | |       |
      +-------+ +-------+ +-------+ +-------+
                            ^
                            occupied, check next

Step 4

Step 4: Linear probing - check next bucket (index 3)

Index 2 is full, try (2+1) % 4 = 3

Index:  0         1         2         3
      +-------+ +-------+ +-------+ +-------+
      | empty | | empty | |apple,v| |banana,v|
      |       | |       | | h=42  | | h=14   |
      +-------+ +-------+ +-------+ +-------+
                                      ^
                                      insert here

Step 5

Step 5: Insert key="cherry", hash("cherry") = 10

hash("cherry") = 10
10 % 4 = 2  ← collision again!

Check index 2: occupied (apple), not (cherry)
Check index 3: occupied (banana), not (cherry)
Check index 0: empty! ← wrap around and insert

Index:  0         1         2         3
      +-------+ +-------+ +-------+ +-------+
      |cherry,v| | empty | |apple,v| |banana,v|
      | h=10   | |       | | h=42  | | h=14   |
      +-------+ +-------+ +-------+ +-------+
        ^
        insert here after wrapping around

Searching for a key

Current state of hash map:

Index:  0         1         2         3
      +-------+ +-------+ +-------+ +-------+
      |cherry,v| | empty | |apple,v| |banana,v|
      | h=10   | |       | | h=42  | | h=14   |
      +-------+ +-------+ +-------+ +-------+

Step 1

Step 1: Search for key="cherry"

hash("cherry") = 10
10 % 4 = 2  ← start searching at index 2

Index:  0         1         2         3
      +-------+ +-------+ +-------+ +-------+
      |cherry,v| | empty | |apple,v| |banana,v|
      | h=10   | |       | | h=42  | | h=14   |
      +-------+ +-------+ +-------+ +-------+
                            ^
                            check here first

Step 2

Step 2: Check index 2

Index 2: key = "apple" ≠ "cherry"
         bucket occupied, continue probing

Index:  0         1         2         3
      +-------+ +-------+ +-------+ +-------+
      |cherry,v| | empty | |apple,v| |banana,v|
      | h=10   | |       | | h=42  | | h=14   |
      +-------+ +-------+ +-------+ +-------+
                            ^
                            not found, try next

Step 3

Step 3: Check index 3 (next probe)

Index 3: key = "banana" ≠ "cherry"
         bucket occupied, continue probing

Index:  0         1         2         3
      +-------+ +-------+ +-------+ +-------+
      |cherry,v| | empty | |apple,v| |banana,v|
      | h=10   | |       | | h=42  | | h=14   |
      +-------+ +-------+ +-------+ +-------+
                                      ^
                                      not found, try next

Step 4

Step 4: Check index 0 (wrap around)

Index 0: key = "cherry" = "cherry" ✓
         FOUND! Return value

Index:  0         1         2         3
      +-------+ +-------+ +-------+ +-------+
      |cherry,v| | empty | |apple,v| |banana,v|
      | h=10   | |       | | h=42  | | h=14   |
      +-------+ +-------+ +-------+ +-------+
        ^
        FOUND: return value v

Key point: Linear probing continues until we either:

  • Find a matching key (success)
  • Find an empty bucket (key doesn't exist)
  • Check all buckets (hash map is full)

What is worse case scenario?

  • All keys map to the same bucket.

  • We have to check all buckets to find the key.

  • This is time complexity.

  • This is the worst case scenario for linear probing.

What is the average case scenario?

  • Each bucket has 1 key.

  • We have to check about 1 bucket to find the key.

  • This is time complexity.

  • This is the average case scenario for linear probing.

Growing the collection: amortization

Keep track of the number of filled entries.

When the number of keys

  • Double
  • Pick new hash function
  • Move the information

Adversarial data

  • Could create lots of collisions

  • Potential basis for denial of service attacks

What makes a good hash function?

  • Uniform distribution of inputs to the buckets available!!!
  • Consistent hashing adds the property that not too many things move around when the number of buckets changes

http://www.partow.net/programming/hashfunctions/index.html https://en.wikipedia.org/wiki/Consistent_hashing https://doc.rust-lang.org/std/collections/struct.HashMap.html

To Dig Deeper (Optional)

Clone, inspect and debug/single-step through a simple implementation that supports creation, insert, get and remove.

See how index is found from hashing the key.

See how collision is handled.

Hashing with custom types in Rust

How do we use custom datatypes as keys?

Required for hashing:

  1. check if K equal
  2. compute a hash function for elements of K
#![allow(unused)]
fn main() {
use std::collections::HashMap;

struct Point {
    x:i64,
    y:i64,
}

let point = Point{x:2,y:-1};

let mut elevation = HashMap::new();

elevation.insert(point,2.3);

}

Most importantly:

[E0277] Error: the trait bound `Point: Eq` is not satisfied
[E0277] Error: the trait bound `Point: Hash` is not satisfied

Required traits for custom types

In order for a data structure to work as a key for hashmap, they need three traits:

  • PartialEq (required by Eq)
    • ✅ Symmetry: If a == b, then b == a.
    • ✅ Transitivity: If a == b and b == c, then a == c.
    • ❌ Reflexivity is NOT guaranteed (because e.g. NaN != NaN in floats).
  • Eq
    • ✅ Reflexivity: a == a is always true.
    • ✅ Symmetry: If a == b, then b == a.
    • ✅ Transitivity: If a == b and b == c, then a == c.
  • Hash
    • Supports deterministic output of a hash function
    • Consistency with Equality -- if two values are equal , then their hashes are equal
    • Non-Invertibility -- One way. You cannot reconstruct the original value from the hash
    • etc...

Default implementation

  • Eq and PartialEq are automatically derived for most types.
#![allow(unused)]
fn main() {
use std::collections::HashMap;

#[derive(Debug,Hash,Eq,PartialEq)]
struct DistanceKM(u64);

let mut tired = HashMap::new();

tired.insert(DistanceKM(30),true);
println!("{:?}", tired);
}

Reminder: All the traits that you can automatically derive from

  • Clone: Allow user to make an explicit copy
  • Copy: Allow user to make an implicit copy
  • Debug: Allow user to print contents
  • Default: Allow user to initialize with default values (Default::default())
  • Hash: Allow user to use it as a key to a hash map or set.
  • Eq: Allow user to test for equality
  • Ord: Allow user to sort and fully order types
  • PartialEq: Obeys most rules for equality but not all
  • PartialOrd: Obeys most rules for ordering but not all

HashSet<K>

"A HashMap without values"

  • No value associated with keys
  • Just a set of items
  • Same implementation
  • Fastest way to do membership tests and some set operations

Creating a HashSet

  • Create: HashSet::new()
  • .insert(), .is_empty(), .contains()
#![allow(unused)]
fn main() {
use std::collections::HashSet;

// create
let mut covid = HashSet::new();
println!("Is empty: {}", covid.is_empty());

// insert values
for i in 2019..=2022 {
    covid.insert(i);
};

println!("Is empty: {}", covid.is_empty());
println!("Contains 2019: {}", covid.contains(&2019));
println!("Contains 2015: {}", covid.contains(&2015));
}

Growing the collection: amortization

  • Let's monitor the length and capacity as we insert values.
#![allow(unused)]
fn main() {
use std::collections::HashSet;

// create
let mut covid = HashSet::new();
println!("Length: {}, Capacity: {}", covid.len(), covid.capacity());
println!("Is empty: {}", covid.is_empty());

// insert values
for i in 2019..=2022 {
    covid.insert(i);
    println!("Length: {}, Capacity: {}", covid.len(), covid.capacity());
};

println!("Length: {}, Capacity: {}", covid.len(), covid.capacity());
println!("Is empty: {}", covid.is_empty());
}
  • More expensive than growing a Vec because we need to rehash all the elements.

Iterating over a HashSet

You can iterate over a HashSet with a for loop.

#![allow(unused)]
fn main() {
use std::collections::HashSet;

// create
let mut covid = HashSet::new();

// insert values
for i in 2019..=2022 {
    covid.insert(i);
};

// use the implicit iterator
for year in &covid {
    print!("{} ",year);
}
println!();

// use the explicit iterator
for year in covid.iter() {
    print!("{} ",year);
}
println!();
}

Question: Why aren't the years in the order we inserted them?


Using .get() and .insert()

We can use .get() and .insert(), similarly to how we used them in HashMaps.

#![allow(unused)]
fn main() {
use std::collections::HashSet;

// create
let mut covid = HashSet::new();

// insert values
for i in 2019..=2022 {
    covid.insert(i);
};

// Returns `None` if not in the HashSet
println!("{:?}", covid.get(&2015));

println!("{:?}", covid.get(&2021));

covid.insert(2015); // insert 2015 if not present
covid.insert(2020); // insert 2020 if not present

// iterate over the set
for year in &covid {
    print!("{} ",year);
}
}

Summary of Useful HashSet Methods

Basic Operations:

  • new(): Creates an empty HashSet.
  • insert(value): Adds a value to the set. Returns true if the value was not present, false otherwise.
  • remove(value): Removes a value from the set. Returns true if the value was present, false otherwise.
  • contains(value): Checks if the set contains a specific value. Returns true if present, false otherwise.
  • len(): Returns the number of elements in the set.
  • is_empty(): Checks if the set contains no elements.
  • clear(): Removes all elements from the set.
  • drain(): Returns an iterator that removes all elements and yields them. The set becomes empty after this operation.

Entry API:

  • entry(value).or_insert(value): Returns a mutable reference to the value in the set, if any, that is equal to the given value. If the value is not present, inserts the given value and returns a mutable reference to the new value.

Set Operations:

  • union(&self, other: &HashSet<T>): Returns an iterator over the elements that are in self or other (or both).
  • intersection(&self, other: &HashSet<T>): Returns an iterator over the elements that are in both self and other.
  • difference(&self, other: &HashSet<T>): Returns an iterator over the elements that are in self but not in other.
  • symmetric_difference(&self, other: &HashSet<T>): Returns an iterator over the elements that are in self or other, but not in both.
  • is_subset(&self, other: &HashSet<T>): Checks if self is a subset of other.
  • is_superset(&self, other: &HashSet<T>): Checks if self is a superset of other.
  • is_disjoint(&self, other: &HashSet<T>): Checks if self has no elements in common with other.

Iterators and Views:

  • iter(): Returns an immutable iterator over the elements in the set.
  • get(value): Returns a reference to the value in the set, if any, that is equal to the given value.

See the documentation for more details.

In-Class Exercise 1: Word Frequency Counter

Task: Create a HashMap that counts the frequency of each word in the following sentence:

"rust is awesome rust is fast rust is safe"

Your code should:

  1. Split the sentence into words. (Hint: Use .split_whitespace() on your string and iterate over the result.)
  2. Count how many times each word appears using a HashMap
  3. Print each word and its frequency

Hint: Use .entry(key).or_insert(value), which returns a mutable reference to the value in the map, to initialize or increment counts.

Expected Output:

rust: 3
is: 3
awesome: 1
fast: 1
safe: 1

In-Class Exercise 2: Programming Languages Analysis

Task: Two developers list their known programming languages. Create two HashSets and perform set operations to analyze their skills.

Developer 1 knows: Rust, Python, JavaScript, C++, Go Developer 2 knows: Python, Java, JavaScript, Ruby, Go

Your code should find and print:

  1. Languages both developers know (intersection)
  2. Languages unique to Developer 1 (difference)
  3. All languages known by at least one developer (union)
  4. Languages known by exactly one developer (symmetric difference)

Hint: Create two HashSets and use set operations methods shown earlier.

Solutions will be added here after class.