anderspitman
anderspitman

Reputation: 10530

Efficiently extract prefix substrings

Currently I'm using the following function to extract prefix substrings:

fn prefix(s: &String, k: usize) -> String {
    s.chars().take(k).collect::<String>()
}

This can then be used for comparisons like so:

let my_string = "ACGT".to_string();
let same = prefix(&my_string, 3) == prefix(&my_string, 2);

However, this allocates a new String for each call to prefix, in addition to the processing for the iteration. Most other languages I'm familiar with have an efficient way to do a comparison like this, using just a view of the strings. Is there a way in Rust?

Upvotes: 2

Views: 3046

Answers (2)

Shepmaster
Shepmaster

Reputation: 431669

Yes, you can take subslices of strings using the Index operation:

fn prefix(s: &str, k: usize) -> &str {
    &s[..k]
}

fn main() {
    let my_string = "ACGT".to_string();
    let same = prefix(&my_string, 3) == prefix(&my_string, 2);
    println!("{}", same);
}

Note that slicing a string uses bytes as the unit, not characters. It is up to the programmer to ensure that the slice lengths lie on valid UTF-8 boundaries. Additionally, you have to ensure that you don't try to slice past the end of the string. Breaking either of these will result in a panic!.

A bit more defensive version would be

fn prefix(s: &str, k: usize) -> &str {
    let idx = s.char_indices().nth(k).map(|(idx, _)| idx).unwrap_or(s.len());
    &s[0..idx]
}

The key difference is that we use the char_indices iterator, which tells us the byte offsets corresponding to a character. Indexing into a UTF-8 string is an O(n) operation, and Rust doesn't want to hide that algorithmic complexity from you. This still isn't even complete, because there can be combining characters, for example. Dealing with strings is hard, thanks to the complexity of human language.

Most other languages I'm familiar with have an efficient way

Doubtful :-) To be efficient in time, they'd have to know how many bytes to skip ahead for every character. Either they'd have to keep a lookup table for every string or use a fixed-size character encoding. Both of those solutions can use more memory than needed, and a fixed size encoding doesn't even work when you have combining characters, for example.

Of course, other languages could just say "LOL, strings are just arrays of bytes, good luck with treating them correctly", and efficiently ignore your character encoding...

Two additional notes

  1. Your predicate doesn't really make sense. A string of 2 letters will never match one of 3 letters. For strings to match, they must have the same amount of bytes.

  2. You should never need to take &String as a function argument. Taking a &str is a more accepting argument in all cases except for one teeny tiny little case that no one needs — knowing the capacity of a String, but without being able to modify the string.

Upvotes: 6

Vladimir Matveev
Vladimir Matveev

Reputation: 128031

While Shepmaster's answer is absolutely correct for the general case of string slicing, I'd like to add that sometimes there are easier ways.

If you know in advance the set of characters you're working with ("ATGC" example suggests you're working with nucleobases, so it is possible that these are all the characters you need) then you can use slices of bytes &[u8] instead of string slices &str. You can always get a byte slice out of a string slice and a Vec<u8> out of a String, if necessary:

let s: String = "ATGC".into();
let ss: &str = &s;

let b: Vec<u8> = s.into_bytes();
let bs: &[u8] = ss.as_slice();

Also, there are byte slice and byte character literals, just prefix regular string/char literals with b:

let sl: &[u8] = b"ATGC";
let bl: u8 = b'G'; 

Working with byte slices give you constant-time indexing (and thus slicing) operations, so checking for prefix equality is easy (like Shepmaster's first variant but without possibility of panics (unless k is too large):

fn prefix(s: &[u8], k: usize) -> &[u8] {
    &s[..k]
}

If you need, you can turn byte slices/vectors back to strings. This operation, of course, checks validity of UTF-8 encoding so it may fail, but if you only work with ASCII, you can safely ignore these errors and just unwrap():

let ss2: &str = str::from_utf8(bs).unwrap();
let s2: String = String::from_utf8(b).unwrap();

Upvotes: 4

Related Questions