Reputation: 753
How can I efficiently extract the key-value pairs from a string into a HashMap
when
key
is always followed by :
and then the valuevalue
ends with a ,
followed by another key
(sometimes whitespace and then key
)value
can contain , :
throughoutvalue
will include any key
key
s are not fixedkey
names are knownFor these key-value pairs
key1:value1, key2:this is, some value2, key3:anothe:r val,ue,
It should produce this HashMap
:
"key1", "value1"
"key2", "this is, some value2"
"key3", "anothe:r val,ue"
I have tried the following code but it is no good with just a ,
as a delimiter as the value can contain commas throughout.
"key1:value1, key2:this is, some value2, key3:anothe:r val,ue,"
.split(",")
.map(|kv| kv.splitn(2, ":").collect::<Vec<&str>>())
.filter(|vec| vec.len() == 2)
.map(|vec| (vec[0].trim().into(), vec[1].trim().into()))
.collect()
My thought would be to provide a list of keys: ["key1", "key2", "key3"]
to use as delimiters
UPDATE:
Using @Lucretiel answer I have come up with:
fn key_value<'a>(keys: &[&str], mut command: &'a str) -> HashMap<&'a str, &'a str> {
let mut hashmap = HashMap::new();
loop {
if let Some(key) = key(&keys, &command) {
command = &command[key.len() + 1..];
let value = value(&keys, &command);
let trim: &[_] = &[',', ' '];
command = &command[value.len()..].trim_start_matches(trim);
hashmap.insert(key, value);
} else {
break;
}
}
hashmap
}
fn key<'a>(keys: &[&str], command: &'a str) -> Option<&'a str> {
let regex = format!("^({}):", keys.join("|"));
let regex = regex::Regex::new(®ex).expect("Invalid regex");
match regex.shortest_match(&command) {
Some(position) => Some(&command[..position - 1]),
None => None,
}
}
fn value<'a>(keys: &[&str], command: &'a str) -> &'a str {
let regex = format!(r#",\s*({}):"#, keys.join("|"));
let regex = regex::Regex::new(®ex).expect("Invalid regex");
match regex.find(&command) {
Some(position) => &command[..position.start()],
None => command,
}
}
Upvotes: 5
Views: 1399
Reputation: 3344
The actual code to solve this is nontrivial, but it can be done. There's a lot of little fiddly edge cases, depending on what error cases you want to account for (for instance, do you require that every key in your known key list is present in the input string to parse? Do you allow duplicate keys? etc.). The basic algorithm looks like this:
^{key}:
. This is the current key.
,\s*{key}:
. This is the next key.
There's no way to do this with a conventional grammar; as presented it's very ambiguous. However, if you structure your parsing around scanning for each subsequent key (assuming that keys never appear as substrings in values) you can successfully parse this kind of input.
The algorithm as described runs in quadratic time, but hypothetically it should be reducible to linear time if you create a composite regular expression to search for every key simultaneously:
,\s*(key1|key2|key3|...):
Upvotes: 3
Reputation: 10464
This isn't as clean as using iterators, but here's one idea. Reading the keys and values is difficult if you read the string from the start due to having to do lookahead to tell whether what you're reading is still part of the value or the next key.
Reading the string backwards, however, is much easier. The last value is everything after the last ':'
. The last key is everything from the last ','
before that to that last ':'
.
For example we'll use your string.
"key1:value1, key2:this is, some value2, key3:another val,ue,"
^ the last ':'
"key1:value1, key2:this is, some value2, key3:another val,ue,"
^ the last ',' before that
"key1:value1, key2:this is, some value2, key3:another val,ue,"
^ the last ':' before that
"key1:value1, key2:this is, some value2, key3:another val,ue,"
^ the last ',' before that
"key1:value1, key2:this is, some value2, key3:another val,ue,"
^ the last ':' before that
As you can see, this perfectly splits up the string into keys and values.
To actually code this, we'll have a slice which always refers to the part of the string we haven't covered yet. At each step we'll find the last ':'
(or ','
) and change the slice to point before that. Using rsplitn
works fairly well here, but I'm sure there's another way.
fn main() {
let mut kv = Vec::new();
let mut slice = "key1:value1, key2:this is, some value2, key3:another val,ue,";
while !slice.is_empty() {
let mut split = slice.rsplitn(2, ':');
// `rsplitn` will always return at least one slice,
// namely the whole string if there aren't any matches.
// So we can unwrap here.
let value = split.next().unwrap().trim();
// You may want to decide to do something else here.
// The only way `split.next()` will be `None` is if
// The input string has incorrect syntax.
slice = split.next().unwrap_or("");
let mut split = slice.rsplitn(2, ',');
// similar reasoning here
let key = split.next().unwrap().trim();
slice = split.next().unwrap_or("");
kv.push((key, value));
}
println!("{:?}", kv);
}
Just one note. The code above counts any trailing commas as part of the last value. If you don't want that, you can do a check. Don't forget to trim
the string first (and this may be worth doing anyway)! In the future, strip_suffix
would work well here. For now, str::ends_with
should do fine.
Upvotes: 2