EmJ
EmJ

Reputation: 4608

How to extract longest possible words from a string in python

I have a python programme as follows.

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        index = sentence.find(item)
        if index >= 0:
             sentence_tokens.append((index, item))
    sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
    output.append(sentence_tokens)

Given the sentences and selected_concepts I need to extract word-level longest concept matches from selected_concepts in the order of the sentence. Both sentences and selected_concepts are preprocessed, so they do not contain any punctuations marks, additional

For example;

sentences = ["i love data mining and machine learning", "python is the best programming language", "the learning process of data mining and python are awesome"]

selected_concepts = ["learning", "python", "programming language", "d", "dat", "data", "data mining", "a m", "machine learning", "l"]

Current Output:

[['l', 'd', 'dat', 'data', 'data mining', 'a m', 'machine learning', 'learning'], ['python', 'programming language', 'l'], ['learning', 'l', 'd', 'dat', 'data', 'data mining', 'a m', 'python']]

I want the output to be;

[["data mining", "machine learning"], ["python", "programming language"], ["learning", "data mining", "python"]]

The issue in my current programme is that it fails to distinguish the overlapping concepts such as d, dat, data and data mining and to obtain only data mining as the output.

I am not interested in using regex patterns as it slowed down the process.

Please let me know if any further details are needed.

Upvotes: 2

Views: 220

Answers (2)

Grismar
Grismar

Reputation: 31319

If I understand your problem correctly, you don't want to include any 'concepts' that are also part of a longer 'concept' you're already finding?

Regex can actually be very efficient and may prove faster than the solution you wrote. However, your solution as you shared can be fixed by simply adding the line:

output = [[w1 for w1 in l if not any([w2 != w1 and w2.find(w1) >= 0 for w2 in l])] for l in output]

That's not very efficient though, since it will still find all the solutions and then run a fairly costly operation to remove all of the duplicates that were included in longer results.

Only sorting the list by length (using regex or otherwise) won't work, since sub-strings may be part of multiple longer strings and should still be found if they are found outside those longer strings. For example, if selected_concepts is something like ["big example", "small example", "example", "small", "big"]. Then running for sentence "this big example has a small solution for example" should still find ["big example", "small", "example"].

However, your code has some more problems, since it ignores your requirement that you're only looking for whole word concepts. In your example, if you add "v" as a concept, it would be found (in love) and it would not be eliminated as part of another concept. Also, concepts that appear both by themselves and as part of a larger concept are eliminated by the line I provided.

A somewhat better, more complete solution (still without regex):

sentences = ["i love data mining and machine learning", "python is the best programming language",
             "the learning process of data mining and python are awesome"]
selected_concepts = ["learning", "python", "programming language", "d", "dat", "data", "data mining", "a m",
                     "machine learning", "l"]

split_sentences = [s.split() for s in sentences]
split_selected_concepts = [s.split() for s in sorted(selected_concepts, key=len, reverse=True)]

sentence_concepts = []
for s in split_sentences:
    concepts = []
    for c in split_selected_concepts:
        new_s = []
        i = 0
        while i < len(s):
            # if concept is found
            if s[i:i + len(c)] == c:
                # save it and skip it, so it isn't found again
                concepts.append((i, c))
                # keep blanks in new_s to ensure correct index for further results
                new_s.extend(len(c) * [None])
                i += len(c)
            else:
                # if the current word doesn't start this concept, keep it
                new_s.append(s[i])
                i += 1
        s = new_s
    # reorder the found concepts and turn the lists back into strings
    sentence_concepts.append([' '.join(x[1]) for x in sorted(concepts, key=lambda x: x[0])])

print(sentence_concepts)

Upvotes: 1

user3483203
user3483203

Reputation: 51165

Regex works here. First, sort your list of concepts descending by length before turning it into a regular expression, since the re module doesn't support overlapping matches. Then when you use re.findall, the longest words will always be matched first.


import re

r = sorted(selected_concepts, key=len, reverse=True)
rgx = '|'.join([fr'\b{word}\b' for word in r])

[re.findall(rgx, sentence) for sentence in sentences]

[['data mining', 'machine learning'],
 ['python', 'programming language'],
 ['learning', 'data mining', 'python']]

Upvotes: 3

Related Questions