Christian
Christian

Reputation: 3393

Regex on list of word as input

I have sentences in form of list of words, for example

sentence = ['if', 'it', 'will', 'rain', ',', 'I', 'will', 'stay', 'at', 'home']

Now I would like to find the conditional clause ['if', 'it', 'will', 'rain']. In principle, I can create a string from the sentence, e.g. s = ' '.join(sentence), I and using regular expressions:

p = re.compile(r'(\bif\b[a-zA-z0-9\'\s]+)\s*(,*)\s*(then|,)')
for m in p.finditer(s):
    print m.start(1), m.end(1), '['+s[ m.start(1) : m.end(1) ]+']'

no need to judge the regex, it's just a quickly sketched on :). This gives me the output: 0 16 [if it will rain ]

So far so good. But now I'm kind of missing the connection to my orignal list. The regex gives me character positions and not word/token positions. Ideally, I would get 0 and 3 so I would know that the conditional clause is sentence[0:3]. I'm sure I can write a method that maps a character position to the corresponding list index, but I'm sure there's a better to do all this.

Of course, I can ignore regular expression, loop over the list and come up with the right start and stop conditions. But regular currently seem rather neat since they "hide" to make the required conditions explicit. They also simplify the case when the conditional clause is indicated by other words or phrases, e.g.:

sentence = ['as', 'long', 'as', 'it', 'will', 'rain', ',', 'I', 'will', 'stay', 'at', 'home']

Easy to reflect this with regex, a bit more annoying with using a loop, I assume.

EDIT: Seeing that there's not really a very simple solution, I've went ahead with my idea of creating a mapping between the sentence as a string for the regex and the original word list:

def join(self, word_list, separator=' '):
    mapping = []
    string = separator.join(word_list)
    for idx, word in enumerate(word_list):
    for character in word:
        mapping.append(idx)
    for character in separator:
        mapping.append(idx)
    return string, mapping

Applying this method to my input string, mapping = join(sentence) results in:

mapping = [0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 9, 9, 9, 9, 9]

Now, if the regex gives me 0 and 16 as range of the match, I can look up the indexes in the original sentence list with mapping[0] = 0 and mapping[16] = 4. So far, this seems to work fairly well. And since I use regex of a string to make the match, I can easily support alternative formulations for the conditional clause, e.g.:

CONDITIONAL_PHRASES = ['if', 'as long as', 'even if']
...
p = re.compile(r"((%s)\s+[a-zA-z0-9'\s]+)\s*(then|,)" % '|'.join(CONDITIONAL_PHRASES))

Again, I'm not saying that the regex is already perfect, but it supports multiple sentences at once with different indicator words for conditional clauses.

Upvotes: 2

Views: 697

Answers (2)

Jongware
Jongware

Reputation: 22457

Switching to and from a regular expression makes it problematic, because you also have to switch your input to and from a string – and keep them synchronized.

How about a list comparison function in which you have a kind of an OR:

sentence = ['if', 'it', 'will', 'rain', ',', 'I', 'will', 'stay', 'at', 'home']
phrase = ['if', [',', 'then']]

def findPhrase(phrase, full):
  currentpos = 0
  isFirst = True
  result = []
  for part in phrase:
    if isinstance(part, list):
      partOffset = 999
      for subpart in part:
        if subpart in full[currentpos:]:
          if full[currentpos:].index(subpart) < partOffset:
            partOffset = full[currentpos:].index(subpart)
      if partOffset == 999:
        return []
      currentpos += partOffset
      if isFirst:
        result.append (currentpos)
      else:
        result[-1] = currentpos
      continue
    if not part in full[currentpos:]:
      return []
    currentpos = currentpos + full[currentpos:].index(part)
    if isFirst:
      result.append (currentpos)
    else:
      result[-1] = currentpos
    # check for a single word match; should still return a range
    # .. just duplicate last item
    if len(result) == 1:
      result.append(result[0])
  return result

res = findPhrase (phrase, sentence)
if res == []:
  print 'not found'
else:
  print res
  print sentence[res[0]:res[1]+1]

This compares the 'phrase' against the 'sentence', one word at a time, and returns [] if there is no match, and a start:end range if there is.

The output of this is

[0, 4]
['if', 'it', 'will', 'rain', ',']

It is possible to extend the findPhrase function with items such as 'optional' and 'any match', but then you'd have to extend the simple array based syntax to something like a dictionary.

Currently, the code skips from one found word to the next, ignoring anything in between. If you want to add an explicit '*' 'phrase' item, meaning "skip any number of words", you need to (1) test if it's the last item in the match phrase (if so, you can emit the last item of sentence), and/or (2) implement a separate lookahead-like function to check if the next item in phrase is present in sentence. (This comes pretty close to mimicking a regex parser.)

Upvotes: 1

rock321987
rock321987

Reputation: 11032

NOTE:- If there is only one occurrence of if and , or then in sentence

I have modified your regex a little bit to include one more capturing group

re.compile("((\\bif\\b)[a-zA-z0-9\\'\\s]+)\\s*(,*)\\s*(then|,)")

You can use re.findall for this as

arr = re.findall(p, s)

arr[0][1] contains the first capturing group (string if) and arr[0][3] contains the third capturing group (string then or ,). You can use index to find the index of these 2 as

start = sentence.index(arr[0][1])
end = sentence.index(arr[0][3])

Now, you can form the string using

stri = ' '.join(sentence[start: end])

NOTE 1:- If there are more than one occurrence of if and , or then in sentence (non-overlapping), you will have to iterate over all tuples

arr = re.findall(p, s)
pos = 0 #It stores the last occurrence of matched group
for i, x in enumerate(arr):
    start = sentence.index(x[1], pos)
    end = sentence.index(x[3], pos)
    stri = ' '.join(sentence[start: end])
    print(stri)
    pos = sentence.index(x[3], pos) + 1

Ideone Demo

NOTE 2:- Keep in mind that index raises an exception if string is not found. Handle it before doing above

Upvotes: 1

Related Questions