Efficiently extract prefix substrings

Question

Currently I'm using the following function to extract prefix substrings:

fn prefix(s: &String, k: usize) -> String {
    s.chars().take(k).collect::()
}

This can then be used for comparisons like so:

let my_string = "ACGT".to_string();
let same = prefix(&my_string, 3) == prefix(&my_string, 2);

However, this allocates a new String for each call to prefix, in addition to the processing for the iteration. Most other languages I'm familiar with have an efficient way to do a comparison like this, using just a view of the strings. Is there a way in Rust?

Shepmaster · Accepted Answer

Yes, you can take subslices of strings using the Index operation:

fn prefix(s: &str, k: usize) -> &str {
    &s[..k]
}

fn main() {
    let my_string = "ACGT".to_string();
    let same = prefix(&my_string, 3) == prefix(&my_string, 2);
    println!("{}", same);
}

Note that slicing a string uses bytes as the unit, not characters. It is up to the programmer to ensure that the slice lengths lie on valid UTF-8 boundaries. Additionally, you have to ensure that you don't try to slice past the end of the string. Breaking either of these will result in a panic!.

A bit more defensive version would be

fn prefix(s: &str, k: usize) -> &str {
    let idx = s.char_indices().nth(k).map(|(idx, _)| idx).unwrap_or(s.len());
    &s[0..idx]
}

The key difference is that we use the char_indices iterator, which tells us the byte offsets corresponding to a character. Indexing into a UTF-8 string is an O(n) operation, and Rust doesn't want to hide that algorithmic complexity from you. This still isn't even complete, because there can be combining characters, for example. Dealing with strings is hard, thanks to the complexity of human language.

Most other languages I'm familiar with have an efficient way

Doubtful :-) To be efficient in time, they'd have to know how many bytes to skip ahead for every character. Either they'd have to keep a lookup table for every string or use a fixed-size character encoding. Both of those solutions can use more memory than needed, and a fixed size encoding doesn't even work when you have combining characters, for example.

Of course, other languages could just say "LOL, strings are just arrays of bytes, good luck with treating them correctly", and efficiently ignore your character encoding...

Two additional notes

Your predicate doesn't really make sense. A string of 2 letters will never match one of 3 letters. For strings to match, they must have the same amount of bytes.
You should never need to take &String as a function argument. Taking a &str is a more accepting argument in all cases except for one teeny tiny little case that no one needs — knowing the capacity of a String, but without being able to modify the string.

Efficiently extract prefix substrings

Answers (2)

Related Questions