Vas
Vas

Reputation: 343

Regex to match a specific sequence of strings

Assuming I have 2 array of strings position1 = ['word1', 'word2', 'word3'] position2 = ['word4', 'word1']

and I want inside a text/string to check if the substring #{target} which exists in text is followed by either one of the words of position1 or following one of the words of the position2 or even both at the same time. Similarly as if I am looking left and right of #{target}.

For example in the sentence "Writing reports and inputting data onto internal systems, with regards to enforcement and immigration papers" if the target word is data I would like to check if the word left (inputting) and right (onto) are included in the arrays or if one of the words in the arrays return true for the regex match. Any suggestions? I am using Ruby and I have tried some regex but I can't make it work yet. I also have to ignore any potential special characters in between.

One of them:

/^.*\b(#{joined_position1})\b.*$[\s,.:-_]*\b#{target}\b[\s,.:-_\\\/]*^.*\b(#{joined_position2})\b.*$/i

Edit:

I figured out this way with regex to capture the word left and right:

(\S+)\s*#{target}\s*(\S+)

However what could I change if I would like to capture more than one words left and right?

Upvotes: 0

Views: 251

Answers (1)

Sebastian Lenartowicz
Sebastian Lenartowicz

Reputation: 4874

If you have two arrays of strings, what you can do is something like this:

matches = /^.+ (\S+) #{target} (\S+) .+$/.match(text)
if matches and (position1.include?(matches[1]) or position2.include?(matches[2]))
    do_something()
end

What this regex does is match the target word in your text and extract the words next to it using capture groups. The code then compares those words against your arrays, and does something if they're in the right places. A more general version of this might look like:

def checkWords(target, text, leftArray, rightArray, numLeft = 1, numRight = 1)
    # Build the regex
    regex = "^.+"
    regex += " (\S+)" * numLeft
    regex += " #{target}"
    regex += " (\S+)" * numRight
    regex += " .+$"

    pattern = Regexp.new(regex)
    matches = pattern.match(text)

    return false if !matches

    for i in 1..numLeft
        return false if (!leftArray.include?(matches[i]))
    end

    for i in 1..numRight
        return false if (!rightArray.include?(matches[numLeft + i]))
    end

    return true
end

Which can then be invoked like this:

do_something() if checkWords("data", text, position1, position2, 2, 2)

I'm pretty sure it's not terribly idiomatic, but it gives you a general sense of how you would do what you in a more general way.

Upvotes: 1

Related Questions