Radix
Radix

Reputation: 264

Merge overlapping multiword in python

I have a list of multiword, for example:

['President Barack', 'Barack Obama', 'New York', 'York City', 'United States', 'States of America', 'This is not overlapping']

I want to merge overlapping multiword to obtain something like this:

['President Barack Obama', 'New York City', 'United States of America', 'This is not overlapping']

I have tried this code taken from another similar question:

strFrag = ['President Barack', 'Barack Obama', 'New York', 'York City', 'United States', 'States of America', 'This is not overlapping']

for repeat in range(0, len(strFrag)-1):
    bestMatch = [2, '', ''] #overlap score (minimum value 3), otherStr index, assembled str portion
    for otherStr in strFrag[1:]:
        for x in range(0,len(otherStr)):
            if otherStr[x:] == strFrag[0][:len(otherStr[x:])]:
                if len(otherStr)-x > bestMatch[0]:
                    bestMatch = [len(otherStr)-x, strFrag.index(otherStr), otherStr[:x]+strFrag[0]]
            if otherStr[:-x] == strFrag[0][-len(otherStr[x:]):]:
                if x > bestMatch[0]:
                    bestMatch = [x, strFrag.index(otherStr), strFrag[0]+otherStr[-x:]]
    if bestMatch[0] > 2:
        strFrag[0] = bestMatch[2]
        strFrag = strFrag[:bestMatch[1]]+strFrag[bestMatch[1]+1:]

But it works only for the first word of the list, giving me this result:

['President Barack Obama', 'New York', 'York City', 'United States', 'States of America', 'This is not overlapping']

My question is: how would you solve it?

Thank you for your time!

EDIT: I want that only near words are merged, so if I have ['President Barack', 'Some word', 'Other word', 'Barack Obama'], President Barack Obama will not merge.

UPDATE: maybe something like this is correct but if possible I want your opinions about it

strFrag = ['President Barack', 'Barack Obama', 'Obama of the USA', 'New York', 'York City', 'Test', 'Hello how', 'how you doin?']

for i in range(len(strFrag)):
  strFrag[i] = strFrag[i].split()

for i in range(len(strFrag)-1,-1,-1):
    if (strFrag[i][0] == strFrag[i-1][-1]):
      strFrag[i-1].remove(strFrag[i-1][-1])
      strFrag[i] = strFrag[i-1] + strFrag[i]
      strFrag.remove(strFrag[i-1])

for i in range(len(strFrag)):
  strFrag[i] = ' '.join(strFrag[i])

It gives me:

['President Barack Obama of the USA',
 'New York City',
 'Test',
 'Hello how you doin?']

Upvotes: 1

Views: 108

Answers (1)

Marthattack
Marthattack

Reputation: 49

I think i would solve it like this:

  1. split each string with space ex : 'york City' => ['York','City']
  2. look if the first and last word can be match with (resp.) the last and first word of another string
  3. assemble these two strings if it works
  4. repeat until you don't do any modifications
  5. concatenate the string you found with the join method

If you want to match more than one word each time, ex : 'the president barack' and 'president barack obama' becoming 'the president barack obama', I would use a loop that consider the next or previous word of each string if the first or last already match

Hope I was clear enough and helped to solve your problem.

Upvotes: 1

Related Questions