Reputation: 264
I have a list of multiword, for example:
['President Barack', 'Barack Obama', 'New York', 'York City', 'United States', 'States of America', 'This is not overlapping']
I want to merge overlapping multiword to obtain something like this:
['President Barack Obama', 'New York City', 'United States of America', 'This is not overlapping']
I have tried this code taken from another similar question:
strFrag = ['President Barack', 'Barack Obama', 'New York', 'York City', 'United States', 'States of America', 'This is not overlapping']
for repeat in range(0, len(strFrag)-1):
bestMatch = [2, '', ''] #overlap score (minimum value 3), otherStr index, assembled str portion
for otherStr in strFrag[1:]:
for x in range(0,len(otherStr)):
if otherStr[x:] == strFrag[0][:len(otherStr[x:])]:
if len(otherStr)-x > bestMatch[0]:
bestMatch = [len(otherStr)-x, strFrag.index(otherStr), otherStr[:x]+strFrag[0]]
if otherStr[:-x] == strFrag[0][-len(otherStr[x:]):]:
if x > bestMatch[0]:
bestMatch = [x, strFrag.index(otherStr), strFrag[0]+otherStr[-x:]]
if bestMatch[0] > 2:
strFrag[0] = bestMatch[2]
strFrag = strFrag[:bestMatch[1]]+strFrag[bestMatch[1]+1:]
But it works only for the first word of the list, giving me this result:
['President Barack Obama', 'New York', 'York City', 'United States', 'States of America', 'This is not overlapping']
My question is: how would you solve it?
Thank you for your time!
EDIT: I want that only near words are merged, so if I have ['President Barack', 'Some word', 'Other word', 'Barack Obama'], President Barack Obama will not merge.
UPDATE: maybe something like this is correct but if possible I want your opinions about it
strFrag = ['President Barack', 'Barack Obama', 'Obama of the USA', 'New York', 'York City', 'Test', 'Hello how', 'how you doin?']
for i in range(len(strFrag)):
strFrag[i] = strFrag[i].split()
for i in range(len(strFrag)-1,-1,-1):
if (strFrag[i][0] == strFrag[i-1][-1]):
strFrag[i-1].remove(strFrag[i-1][-1])
strFrag[i] = strFrag[i-1] + strFrag[i]
strFrag.remove(strFrag[i-1])
for i in range(len(strFrag)):
strFrag[i] = ' '.join(strFrag[i])
It gives me:
['President Barack Obama of the USA',
'New York City',
'Test',
'Hello how you doin?']
Upvotes: 1
Views: 108
Reputation: 49
I think i would solve it like this:
If you want to match more than one word each time, ex : 'the president barack' and 'president barack obama' becoming 'the president barack obama', I would use a loop that consider the next or previous word of each string if the first or last already match
Hope I was clear enough and helped to solve your problem.
Upvotes: 1