Reputation: 10530
Currently I'm using the following function to extract prefix substrings:
fn prefix(s: &String, k: usize) -> String {
s.chars().take(k).collect::<String>()
}
This can then be used for comparisons like so:
let my_string = "ACGT".to_string();
let same = prefix(&my_string, 3) == prefix(&my_string, 2);
However, this allocates a new String
for each call to prefix
, in addition to the processing for the iteration. Most other languages I'm familiar with have an efficient way to do a comparison like this, using just a view of the strings. Is there a way in Rust?
Upvotes: 2
Views: 3046
Reputation: 431669
Yes, you can take subslices of strings using the Index
operation:
fn prefix(s: &str, k: usize) -> &str {
&s[..k]
}
fn main() {
let my_string = "ACGT".to_string();
let same = prefix(&my_string, 3) == prefix(&my_string, 2);
println!("{}", same);
}
Note that slicing a string uses bytes as the unit, not characters. It is up to the programmer to ensure that the slice lengths lie on valid UTF-8 boundaries. Additionally, you have to ensure that you don't try to slice past the end of the string. Breaking either of these will result in a panic!
.
A bit more defensive version would be
fn prefix(s: &str, k: usize) -> &str {
let idx = s.char_indices().nth(k).map(|(idx, _)| idx).unwrap_or(s.len());
&s[0..idx]
}
The key difference is that we use the char_indices
iterator, which tells us the byte offsets corresponding to a character. Indexing into a UTF-8 string is an O(n)
operation, and Rust doesn't want to hide that algorithmic complexity from you. This still isn't even complete, because there can be combining characters, for example. Dealing with strings is hard, thanks to the complexity of human language.
Most other languages I'm familiar with have an efficient way
Doubtful :-) To be efficient in time, they'd have to know how many bytes to skip ahead for every character. Either they'd have to keep a lookup table for every string or use a fixed-size character encoding. Both of those solutions can use more memory than needed, and a fixed size encoding doesn't even work when you have combining characters, for example.
Of course, other languages could just say "LOL, strings are just arrays of bytes, good luck with treating them correctly", and efficiently ignore your character encoding...
Two additional notes
Your predicate doesn't really make sense. A string of 2 letters will never match one of 3 letters. For strings to match, they must have the same amount of bytes.
You should never need to take &String
as a function argument. Taking a &str
is a more accepting argument in all cases except for one teeny tiny little case that no one needs — knowing the capacity of a String
, but without being able to modify the string.
Upvotes: 6
Reputation: 128031
While Shepmaster's answer is absolutely correct for the general case of string slicing, I'd like to add that sometimes there are easier ways.
If you know in advance the set of characters you're working with ("ATGC" example suggests you're working with nucleobases, so it is possible that these are all the characters you need) then you can use slices of bytes &[u8]
instead of string slices &str
. You can always get a byte slice out of a string slice and a Vec<u8>
out of a String
, if necessary:
let s: String = "ATGC".into();
let ss: &str = &s;
let b: Vec<u8> = s.into_bytes();
let bs: &[u8] = ss.as_slice();
Also, there are byte slice and byte character literals, just prefix regular string/char literals with b
:
let sl: &[u8] = b"ATGC";
let bl: u8 = b'G';
Working with byte slices give you constant-time indexing (and thus slicing) operations, so checking for prefix equality is easy (like Shepmaster's first variant but without possibility of panics (unless k
is too large):
fn prefix(s: &[u8], k: usize) -> &[u8] {
&s[..k]
}
If you need, you can turn byte slices/vectors back to strings. This operation, of course, checks validity of UTF-8 encoding so it may fail, but if you only work with ASCII, you can safely ignore these errors and just unwrap()
:
let ss2: &str = str::from_utf8(bs).unwrap();
let s2: String = String::from_utf8(b).unwrap();
Upvotes: 4