How can I extract keys and values from a string when the value contains the separator between keys and values or the separator between pairs?

Question

How can I efficiently extract the key-value pairs from a string into a HashMap when

key is always followed by : and then the value
value ends with a , followed by another key (sometimes whitespace and then key)
value can contain , : throughout
no value will include any key
the order of the keys are not fixed
the key names are known

For these key-value pairs

key1:value1, key2:this is, some value2, key3:anothe:r val,ue,

It should produce this HashMap:

"key1", "value1"
"key2", "this is, some value2"
"key3", "anothe:r val,ue"

I have tried the following code but it is no good with just a , as a delimiter as the value can contain commas throughout.

"key1:value1, key2:this is, some value2, key3:anothe:r val,ue,"
    .split(",")
    .map(|kv| kv.splitn(2, ":").collect::>())
    .filter(|vec| vec.len() == 2)
    .map(|vec| (vec[0].trim().into(), vec[1].trim().into()))
    .collect()

My thought would be to provide a list of keys: ["key1", "key2", "key3"] to use as delimiters

UPDATE:

Using @Lucretiel answer I have come up with:

fn key_value<'a>(keys: &[&str], mut command: &'a str) -> HashMap<&'a str, &'a str> {
    let mut hashmap = HashMap::new();
    loop {
        if let Some(key) = key(&keys, &command) {
            command = &command[key.len() + 1..];

            let value = value(&keys, &command);
            let trim: &[_] = &[',', ' '];
            command = &command[value.len()..].trim_start_matches(trim);

            hashmap.insert(key, value);
        } else {
            break;
        }
    }
    hashmap
}

fn key<'a>(keys: &[&str], command: &'a str) -> Option<&'a str> {
    let regex = format!("^({}):", keys.join("|"));
    let regex = regex::Regex::new(®ex).expect("Invalid regex");
    match regex.shortest_match(&command) {
        Some(position) => Some(&command[..position - 1]),
        None => None,
    }
}

fn value<'a>(keys: &[&str], command: &'a str) -> &'a str {
    let regex = format!(r#",\s*({}):"#, keys.join("|"));
    let regex = regex::Regex::new(®ex).expect("Invalid regex");
    match regex.find(&command) {
        Some(position) => &command[..position.start()],
        None => command,
    }
}

(Playground)

Lucretiel · Accepted Answer

The actual code to solve this is nontrivial, but it can be done. There's a lot of little fiddly edge cases, depending on what error cases you want to account for (for instance, do you require that every key in your known key list is present in the input string to parse? Do you allow duplicate keys? etc.). The basic algorithm looks like this:

while the key list isn't empty:
- find the key that starts the string, matching ^{key}:. This is the current key.
  - if there is no such key, it's an error; the input is malformed
- find the next earliest key in the string, matching ,\s*{key}:. This is the next key.
  - if there are no more keys, the remainder of the string is the value for this key
  - otherwise, all the content between the two found keys is the current value
- add (current key, current value) to your hash table
- remove current key from the key list (assuming you don't accept duplicate keys)
- slice the (current key, current value) off the front of your input string
Once you're out of keys, return the hash map

There's no way to do this with a conventional grammar; as presented it's very ambiguous. However, if you structure your parsing around scanning for each subsequent key (assuming that keys never appear as substrings in values) you can successfully parse this kind of input.

The algorithm as described runs in quadratic time, but hypothetically it should be reducible to linear time if you create a composite regular expression to search for every key simultaneously:

,\s*(key1|key2|key3|...):

How can I extract keys and values from a string when the value contains the separator between keys and values or the separator between pairs?

Answers (2)

Related Questions