James
James

Reputation: 753

How can I extract keys and values from a string when the value contains the separator between keys and values or the separator between pairs?

How can I efficiently extract the key-value pairs from a string into a HashMap when

For these key-value pairs

key1:value1, key2:this is, some value2, key3:anothe:r val,ue,

It should produce this HashMap:

"key1", "value1"
"key2", "this is, some value2"
"key3", "anothe:r val,ue"

I have tried the following code but it is no good with just a , as a delimiter as the value can contain commas throughout.

"key1:value1, key2:this is, some value2, key3:anothe:r val,ue,"
    .split(",")
    .map(|kv| kv.splitn(2, ":").collect::<Vec<&str>>())
    .filter(|vec| vec.len() == 2)
    .map(|vec| (vec[0].trim().into(), vec[1].trim().into()))
    .collect()

My thought would be to provide a list of keys: ["key1", "key2", "key3"] to use as delimiters

UPDATE:

Using @Lucretiel answer I have come up with:

fn key_value<'a>(keys: &[&str], mut command: &'a str) -> HashMap<&'a str, &'a str> {
    let mut hashmap = HashMap::new();
    loop {
        if let Some(key) = key(&keys, &command) {
            command = &command[key.len() + 1..];

            let value = value(&keys, &command);
            let trim: &[_] = &[',', ' '];
            command = &command[value.len()..].trim_start_matches(trim);

            hashmap.insert(key, value);
        } else {
            break;
        }
    }
    hashmap
}

fn key<'a>(keys: &[&str], command: &'a str) -> Option<&'a str> {
    let regex = format!("^({}):", keys.join("|"));
    let regex = regex::Regex::new(&regex).expect("Invalid regex");
    match regex.shortest_match(&command) {
        Some(position) => Some(&command[..position - 1]),
        None => None,
    }
}

fn value<'a>(keys: &[&str], command: &'a str) -> &'a str {
    let regex = format!(r#",\s*({}):"#, keys.join("|"));
    let regex = regex::Regex::new(&regex).expect("Invalid regex");
    match regex.find(&command) {
        Some(position) => &command[..position.start()],
        None => command,
    }
}

(Playground)

Upvotes: 5

Views: 1399

Answers (2)

Lucretiel
Lucretiel

Reputation: 3344

The actual code to solve this is nontrivial, but it can be done. There's a lot of little fiddly edge cases, depending on what error cases you want to account for (for instance, do you require that every key in your known key list is present in the input string to parse? Do you allow duplicate keys? etc.). The basic algorithm looks like this:

  • while the key list isn't empty:
    • find the key that starts the string, matching ^{key}:. This is the current key.
      • if there is no such key, it's an error; the input is malformed
    • find the next earliest key in the string, matching ,\s*{key}:. This is the next key.
      • if there are no more keys, the remainder of the string is the value for this key
      • otherwise, all the content between the two found keys is the current value
    • add (current key, current value) to your hash table
    • remove current key from the key list (assuming you don't accept duplicate keys)
    • slice the (current key, current value) off the front of your input string
  • Once you're out of keys, return the hash map

There's no way to do this with a conventional grammar; as presented it's very ambiguous. However, if you structure your parsing around scanning for each subsequent key (assuming that keys never appear as substrings in values) you can successfully parse this kind of input.

The algorithm as described runs in quadratic time, but hypothetically it should be reducible to linear time if you create a composite regular expression to search for every key simultaneously:

,\s*(key1|key2|key3|...):

Upvotes: 3

SCappella
SCappella

Reputation: 10464

This isn't as clean as using iterators, but here's one idea. Reading the keys and values is difficult if you read the string from the start due to having to do lookahead to tell whether what you're reading is still part of the value or the next key.

Reading the string backwards, however, is much easier. The last value is everything after the last ':'. The last key is everything from the last ',' before that to that last ':'.

For example we'll use your string.

"key1:value1, key2:this is, some value2, key3:another val,ue,"
                                             ^ the last ':'
"key1:value1, key2:this is, some value2, key3:another val,ue,"
                                       ^ the last ',' before that
"key1:value1, key2:this is, some value2, key3:another val,ue,"
                  ^ the last ':' before that
"key1:value1, key2:this is, some value2, key3:another val,ue,"
            ^ the last ',' before that
"key1:value1, key2:this is, some value2, key3:another val,ue,"
     ^ the last ':' before that

As you can see, this perfectly splits up the string into keys and values.

To actually code this, we'll have a slice which always refers to the part of the string we haven't covered yet. At each step we'll find the last ':' (or ',') and change the slice to point before that. Using rsplitn works fairly well here, but I'm sure there's another way.

fn main() {
    let mut kv = Vec::new();
    let mut slice = "key1:value1, key2:this is, some value2, key3:another val,ue,";
    while !slice.is_empty() {
        let mut split = slice.rsplitn(2, ':');
        // `rsplitn` will always return at least one slice,
        // namely the whole string if there aren't any matches.
        // So we can unwrap here.
        let value = split.next().unwrap().trim();
        // You may want to decide to do something else here.
        // The only way `split.next()` will be `None` is if
        // The input string has incorrect syntax.
        slice = split.next().unwrap_or("");

        let mut split = slice.rsplitn(2, ',');
        // similar reasoning here
        let key = split.next().unwrap().trim();
        slice = split.next().unwrap_or("");

        kv.push((key, value));
    }
    println!("{:?}", kv);
}

(playground)

Just one note. The code above counts any trailing commas as part of the last value. If you don't want that, you can do a check. Don't forget to trim the string first (and this may be worth doing anyway)! In the future, strip_suffix would work well here. For now, str::ends_with should do fine.

Upvotes: 2

Related Questions