Scott
Scott

Reputation: 1247

Advanced regex: Split string based on multiple variations of different names, retain delimiters in their own array item

I'm trying to build a Javascript program that switches multiple variations of names with each other.

For example, if I had a string:

let string = "This is Donald Trump and I am Donald J. Trump and I have replaced Barack Obama and Obama was before me."

I would want the output to be:

newString = "This is Barack Obama and I am Barack H. Obama and I have replaced Donald Trump and Trump was before me."

My strategy was to use

 let arr = string.split(regex)

in such a way that each chunk of text before and after a regex match is its own index, and each regex match is its own index too. For example:

["This is ", "Donald Trump", " and I am ", "Donald J. Trump", " and I have replaced ", "Barack Obama", " and ", "Obama", " was before me."];

Then check each item of the array to see if it needs to be "switched." For example:

for (let i = 0; i < arr.length; i++) {
  // if arr[i] == Donald J. Trump, Donald Trump, or Trump, arr[i] = equivalent Obama variation
  // else if arr[i] == Barack H. Obama, Barack Obama, or Obama, arr[i] = equivalent Trump variation
  // else arr[i] = arr[i]
}
let newString = arr.join(" ");
htmlElement.innerHTML(newString);

Here's my regex

let regex = /\b(Barack\s)?(H\.\s)?Obama|\b(Donald\s)?(J\.\s)?Trump/;

The regex seems to correctly match all variations of the names.

However, when I write

arr = string.split(regex)

my arr looks like this:

["This is ", undefined, undefined, "Donald ", undefined, " and I am ", undefined, undefined, "Donald ", "J. ", " and I have replaced ", undefined, "Barack ", undefined, undefined, " and ", undefined, undefined, undefined, undefined, " was before me."];

Is there a way to split the string by the multiple variations of the delimiter, but also retain the delimiter in its own array item?

Upvotes: 0

Views: 150

Answers (2)

ctwheels
ctwheels

Reputation: 22817

Code

I took a different approach to your problem. Instead of searching for specific names I created a regex that captures full names (assuming each name begins with a capital letter and has more than 1 character or is immediately followed by a dot). I then crossreference this full name (split on spaces) against a nameEquivalents object for the proper replacement.

Yes, I am aware that the regex will not catch special cases such as names with two-letter abbreviations, apostrophes, hyphens, starting with non-uppercase letters, etc. but the need wasn't specified by the OP (and frankly I'm not too worried about it since my regex could capture more than the OP's original regex of simply putting the names directly in it).

Also, note that the getKeyByValue function is taken from this answer on this question.

let string = "This is Donald Trump and I am Donald J. Trump and I have replaced Barack Obama and Obama was before me."
let regex = /(?: ?\b[A-Z](?:[a-zA-Z]+\b|\.))+/g
let nameEquivalents = {
  "Obama": "Trump",
  "Barack": "Donald",
  "H.": "J."
}

function getKeyByValue(object, value) {
  return Object.keys(object).find(key => object[key] === value);
}

let newString = string.replace(regex, function(match) {
  matches = match.split(" ").filter(String)
  return matches.map(function(m){
    if(nameEquivalents.hasOwnProperty(m)) {
      return " " + nameEquivalents[m]
    } else {
      let v = getKeyByValue(nameEquivalents, m)
      if(v) {
        return " " + v
      }
    }
    return m
  }).join("")
})

console.log(newString)


Explanation

  • (?: ?\b[A-Z](?:[a-zA-Z]+|\.))+ Match the following one or more times
    • ? Optionally match a space character (there's a space before the ? but SO doesn't actually display it there)
    • \b Assert position as a word boundary
    • [A-Z] Match an uppercase letter
    • (?:[a-zA-Z]+\b|\.) Match either of the following
      • [a-zA-Z]+\b Match any letter one or more times ensuring it's followed by a word boundary
      • \. Match a literal dot

Upvotes: 1

Tom Elmore
Tom Elmore

Reputation: 1980

I think the parentheses in the regex are being interpreted as capture groups and so in matches that dont fulfill all captures you are getting undefined captures.

Try removing all parenthesis and just wrapping the whole lot in a single capture.

 /\b(Barack\s?H\.\s?Obama|\bDonald\s?J\.\s?Trump)/

Upvotes: 0

Related Questions