optevo
optevo

Reputation: 2104

Implementing a Thread-Safe and Efficient Interning Map for Strings in Rust

I’d appreciate help with ensuring thread safety and improving the efficiency of a global string interning map in Rust, particularly when using key-specific locks to manage concurrent access and prevent race conditions during string deduplication.

The Text module snippet shown below is designed to manage strings in a memory-efficient and performant way by deduplicating them. It uses a global interning map to store unique string instances, leveraging Arc for reference counting and DashMap for thread-safe, concurrent access. Key features are:

The challenge is ensuring the interning map (STRINGS) remains thread-safe and efficient while preventing duplicate strings in a multi-threaded context. When a Text instance is dropped, its associated string should be removed from the interning map (not necessarily immediately) only when no other references exist. However, there could be race conditions as other threads could clone the Arc during the Drop process, creating a reference-count mismatch. I considered using a queue to manage removals but found it insufficient to resolve race conditions (although could help with other issues like removing a large number of strings).

My current approach is to use key-specific locks (KEY_LOCKS) to ensure atomic access during Text creation, access and deletion of each string. I'd appreciate any feedback or advice on:

This is the relevant code from the module:

use dashmap::{DashMap, DashSet};
use std::{
    sync::{Arc, LazyLock, Mutex},
    hash::{Hash, Hasher},
};

static STRINGS: LazyLock<DashSet<Arc<String>>> = LazyLock::new(|| DashSet::new());

static KEY_LOCKS: LazyLock<DashMap<Arc<String>, Arc<Mutex<()>>>> = LazyLock::new(|| DashMap::new());

fn get_key_lock(key: Arc<String>) -> Arc<Mutex<()>> {
    KEY_LOCKS
        .entry(key)
        .or_insert_with(|| Arc::new(Mutex::new(())))
        .clone()
}

#[derive(Debug, Clone)]
pub struct Text(Arc<String>);

impl Text {
    fn create_or_get_text(arc: Arc<String>) -> Self {
        let key_lock = get_key_lock(arc.clone());
        let _lock = key_lock.lock().unwrap();

        if let Some(existing) = STRINGS.get(&arc) {
            existing.clone()
        } else {
            STRINGS.insert(arc.clone());
            Text(arc)
        }
    }
}

impl Drop for Text {
    fn drop(&mut self) {
        let key_lock = get_key_lock(self.0.clone());
        let _lock = key_lock.lock().unwrap();
        if Arc::strong_count(&self.0) == 2 {
            STRINGS.remove(&self.0);
        }
    }
}

impl Hash for Text {
    fn hash<H: Hasher>(&self, state: &mut H) {
        let addr = Arc::as_ptr(&self.0) as usize as u64;
        state.write_u64(addr);
    }
}

impl PartialEq for Text {
    fn eq(&self, other: &Self) -> bool {
        Arc::ptr_eq(&self.0, &other.0)
    }
}

impl Eq for Text {}

Notes:

STRINGS is used for fast lookups and deduplication of Text instances.

KEY_LOCKS is intended to ensure thread-safe access for creating, accessing, or dropping Text objects.

STRINGS and KEY_LOCKS are maintained as separate structures mainly so I don't have the overhaad of a mutex for every interned string.

Drop Semantics: When a Text instance is dropped, it check to see if it should be removed its corresponding entry from the global interning map. The Arc reference count (strong_count) is being tested to be 3: When ready to discard from the interning map, drop will be passed a Text with 2 references: one is held by the STRINGS interning map and one reference is held by the Text instance being dropped. An additional reference is created during the drop function when using KEY_LOCKS.

Upvotes: 1

Views: 74

Answers (1)

cdhowie
cdhowie

Reputation: 169291

It seems like you should be able to remove the mutexes altogether if you use remove_if to handle eviction. Under the hood, this takes out a write lock on the shard containing the bucket and holds it until the entire operation completes, which makes concurrent access to the set-owned Arc impossible. Therefore, you should be able to do something like this:

impl Drop for Text {
    fn drop(&mut self) {
        STRINGS.remove_if(&self.0, |s| Arc::strong_count(s) == 2);
    }
}

However, after making this change (and removing the mutexes) there is a race condition on insertion: if two calls to create_or_get_text are made concurrently and both do not find the string present, they will both try to insert it, resulting in two Text values that contain equal but different Strings.

We can correct this by detecting the case where insertion fails and restarting the operation. In other words, we'll try getting the string from the map, then try inserting it, then start over until one of the two succeeds. This is essentially optimistic locking.

fn create_or_get_text(arc: Arc<String>) -> Self {
    Self(loop {
        if let Some(existing) = STRINGS.get(&arc) {
            break existing.clone();
        }
        if STRINGS.insert(arc.clone()) {
            break arc;
        }
    })
}

Side note: you can make your Hash implementation more resilient against different pointer sizes by just hashing the pointer itself.

impl Hash for Text {
    fn hash<H: Hasher>(&self, state: &mut H) {
        Arc::as_ptr(&self.0).hash(state);
    }
}

Complete code after the above changes:

use dashmap::DashSet;
use std::{
    hash::{Hash, Hasher},
    sync::{Arc, LazyLock},
};

static STRINGS: LazyLock<DashSet<Arc<String>>> = LazyLock::new(DashSet::new);

#[derive(Debug, Clone)]
pub struct Text(Arc<String>);

impl Text {
    fn create_or_get_text(arc: Arc<String>) -> Self {
        Self(loop {
            if let Some(existing) = STRINGS.get(&arc) {
                break existing.clone();
            }
            if STRINGS.insert(arc.clone()) {
                break arc;
            }
        })
    }
}

impl Drop for Text {
    fn drop(&mut self) {
        STRINGS.remove_if(&self.0, |s| Arc::strong_count(s) == 2);
    }
}

impl Hash for Text {
    fn hash<H: Hasher>(&self, state: &mut H) {
        Arc::as_ptr(&self.0).hash(state);
    }
}

impl PartialEq for Text {
    fn eq(&self, other: &Self) -> bool {
        Arc::ptr_eq(&self.0, &other.0)
    }
}

impl Eq for Text {}

Upvotes: 2

Related Questions