Reputation: 3393
I have sentences in form of list of words, for example
sentence = ['if', 'it', 'will', 'rain', ',', 'I', 'will', 'stay', 'at', 'home']
Now I would like to find the conditional clause ['if', 'it', 'will', 'rain']
. In principle, I can create a string from the sentence, e.g. s = ' '.join(sentence)
, I and using regular expressions:
p = re.compile(r'(\bif\b[a-zA-z0-9\'\s]+)\s*(,*)\s*(then|,)')
for m in p.finditer(s):
print m.start(1), m.end(1), '['+s[ m.start(1) : m.end(1) ]+']'
no need to judge the regex, it's just a quickly sketched on :). This gives me the output: 0 16 [if it will rain ]
So far so good. But now I'm kind of missing the connection to my orignal list. The regex gives me character positions and not word/token positions. Ideally, I would get 0 and 3 so I would know that the conditional clause is sentence[0:3]
. I'm sure I can write a method that maps a character position to the corresponding list index, but I'm sure there's a better to do all this.
Of course, I can ignore regular expression, loop over the list and come up with the right start and stop conditions. But regular currently seem rather neat since they "hide" to make the required conditions explicit. They also simplify the case when the conditional clause is indicated by other words or phrases, e.g.:
sentence = ['as', 'long', 'as', 'it', 'will', 'rain', ',', 'I', 'will', 'stay', 'at', 'home']
Easy to reflect this with regex, a bit more annoying with using a loop, I assume.
EDIT: Seeing that there's not really a very simple solution, I've went ahead with my idea of creating a mapping between the sentence as a string for the regex and the original word list:
def join(self, word_list, separator=' '):
mapping = []
string = separator.join(word_list)
for idx, word in enumerate(word_list):
for character in word:
mapping.append(idx)
for character in separator:
mapping.append(idx)
return string, mapping
Applying this method to my input string, mapping = join(sentence)
results in:
mapping = [0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 9, 9, 9, 9, 9]
Now, if the regex gives me 0
and 16
as range of the match, I can look up the indexes in the original sentence
list with mapping[0] = 0
and mapping[16] = 4
. So far, this seems to work fairly well. And since I use regex of a string to make the match, I can easily support alternative formulations for the conditional clause, e.g.:
CONDITIONAL_PHRASES = ['if', 'as long as', 'even if']
...
p = re.compile(r"((%s)\s+[a-zA-z0-9'\s]+)\s*(then|,)" % '|'.join(CONDITIONAL_PHRASES))
Again, I'm not saying that the regex is already perfect, but it supports multiple sentences at once with different indicator words for conditional clauses.
Upvotes: 2
Views: 697
Reputation: 22457
Switching to and from a regular expression makes it problematic, because you also have to switch your input to and from a string – and keep them synchronized.
How about a list comparison function in which you have a kind of an OR:
sentence = ['if', 'it', 'will', 'rain', ',', 'I', 'will', 'stay', 'at', 'home']
phrase = ['if', [',', 'then']]
def findPhrase(phrase, full):
currentpos = 0
isFirst = True
result = []
for part in phrase:
if isinstance(part, list):
partOffset = 999
for subpart in part:
if subpart in full[currentpos:]:
if full[currentpos:].index(subpart) < partOffset:
partOffset = full[currentpos:].index(subpart)
if partOffset == 999:
return []
currentpos += partOffset
if isFirst:
result.append (currentpos)
else:
result[-1] = currentpos
continue
if not part in full[currentpos:]:
return []
currentpos = currentpos + full[currentpos:].index(part)
if isFirst:
result.append (currentpos)
else:
result[-1] = currentpos
# check for a single word match; should still return a range
# .. just duplicate last item
if len(result) == 1:
result.append(result[0])
return result
res = findPhrase (phrase, sentence)
if res == []:
print 'not found'
else:
print res
print sentence[res[0]:res[1]+1]
This compares the 'phrase' against the 'sentence', one word at a time, and returns []
if there is no match, and a start:end
range if there is.
The output of this is
[0, 4]
['if', 'it', 'will', 'rain', ',']
It is possible to extend the findPhrase
function with items such as 'optional' and 'any match', but then you'd have to extend the simple array based syntax to something like a dictionary.
Currently, the code skips from one found word to the next, ignoring anything in between. If you want to add an explicit '*'
'phrase' item, meaning "skip any number of words", you need to (1) test if it's the last item in the match phrase (if so, you can emit the last item of sentence
), and/or (2) implement a separate lookahead-like function to check if the next item in phrase
is present in sentence
. (This comes pretty close to mimicking a regex parser.)
Upvotes: 1
Reputation: 11032
NOTE:- If there is only one occurrence of if
and ,
or then
in sentence
I have modified your regex a little bit to include one more capturing group
re.compile("((\\bif\\b)[a-zA-z0-9\\'\\s]+)\\s*(,*)\\s*(then|,)")
You can use re.findall
for this as
arr = re.findall(p, s)
arr[0][1]
contains the first capturing group (string if
) and arr[0][3]
contains the third capturing group (string then
or ,
). You can use index to find the index of these 2 as
start = sentence.index(arr[0][1])
end = sentence.index(arr[0][3])
Now, you can form the string using
stri = ' '.join(sentence[start: end])
NOTE 1:- If there are more than one occurrence of if
and ,
or then
in sentence
(non-overlapping), you will have to iterate over all tuples
arr = re.findall(p, s)
pos = 0 #It stores the last occurrence of matched group
for i, x in enumerate(arr):
start = sentence.index(x[1], pos)
end = sentence.index(x[3], pos)
stri = ' '.join(sentence[start: end])
print(stri)
pos = sentence.index(x[3], pos) + 1
NOTE 2:- Keep in mind that index
raises an exception if string is not found. Handle it before doing above
Upvotes: 1