Jack Daniels
Jack Daniels

Reputation: 123

Merge Similar Strings Python

I'm having two strings

string1 = "apple banna kiwi mango"
string2 = "aple banana mango lemon"

I want the resultant of addition of these two strings (not concatenation) i.e. result should look like

result = "apple banana kiwi mango lemon"

My current approach is rather simple.

  1. Tokenize the multiline string (the above strings are after tokenization), remove any noises (special/ newline characters/ empty strings)
  2. The next step is to identify the cosine similarity of the strings, if it is above 0.9, then I add one of the string to final result

Now, here is the problem. It doesn't cover the part where one string contains one half of a word and other contains the other half (or correct word in some cases) of word. I have also added this function in my script. But again the problem remains. Any help on how to move forward with this is appreciated.

def text_to_vector(text):
     words = WORD.findall(text)
     return Counter(words)

def get_cosine(vec1, vec2):
     intersection = set(vec1.keys()) & set(vec2.keys())
     numerator = sum([vec1[x] * vec2[x] for x in intersection])

     sum1 = sum([vec1[x]**2 for x in vec1.keys()])
     sum2 = sum([vec2[x]**2 for x in vec2.keys()])
     denominator = math.sqrt(sum1) * math.sqrt(sum2)

     if not denominator:
        return 0.0
     else:
        return float(numerator) / denominator


def merge_string(string1, string2):
    i = 0
    while not string2.startswith(string1[i:]):
        i += 1

    sFinal = string1[:i] + string2
    return sFinal

for item in c:
for j in d:
    vec1 = text_to_vector(item)
    vec2 = text_to_vector(j)
    r = get_cosine(vec1, vec2)
    if r > 0.5:
        if r > 0.85:
            final.append(item)
            break
        else:
            sFinal = merge_string(item, j)
            #print("1.", len(sFinal), len(item), len(j))
            if len(sFinal) >= len(item) + len(j) -8:
                sFinal = merge_string(j, item)
                final.append(sFinal)
                #print("2.", len(sFinal), len(item), len(j))
                temp.append([item, j])
                break

Upvotes: 2

Views: 2849

Answers (2)

LetzerWille
LetzerWille

Reputation: 5658

The difficult part is to check if the word is a valid English word.

For this either you have to have a dictionary to check the word against, or use nltk.

     pip install nltk  

     from nltk.corpus import wordnet  

     set([w for w in (string1 + string2).split() if  wordnet.synsets(w)]) 

     Out[41]: {'apple', 'banana', 'kiwi', 'lemon', 'mango'}

To catch digits, if present, add isdigit().

st1 = 'Includes Og Added Sugars'

st2 = 'Includes 09 Added Sugars 09'


set([w for w in (st1 + st2).split() if  (wordnet.synsets(w) or w.isdigit())])

Out[30]: {'09', 'Added', 'Includes', 'Sugars'}

To catch abbreviations like g, mg add re.match().

set([w for w in (st1 + st2).split() if  (wordnet.synsets(w) or w.isdigit() or re.match(r'\d+g|mg',w))])

Out[40]: {'09', '0g', 'Added', 'Includes', 'Sugars'}

Upvotes: 4

Kurns
Kurns

Reputation: 157

Have you ever heard of Levenshtein's distance? I suggest the following algorithm:

  1. Split the lists into elements (string1.split(" "))

  2. Loop through list(string1). Inside it loop through list(string2) and if Levenshtein's distance for the two elements is say, less than 3, push the element to the result array.

  3. Return result.

for i in list(string1): for k in list(string2): if levenshtein(i,k) < 3: res.append(i)

Upvotes: 1

Related Questions