Reputation: 4608
I have a python programme as follows.
output = []
for sentence in sentences:
sentence_tokens = []
for item in selected_concepts:
index = sentence.find(item)
if index >= 0:
sentence_tokens.append((index, item))
sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
output.append(sentence_tokens)
Given the sentences
and selected_concepts
I need to extract word-level longest concept matches from selected_concepts
in the order of the sentence. Both sentences
and selected_concepts
are preprocessed, so they do not contain any punctuations marks, additional
For example;
sentences = ["i love data mining and machine learning", "python is the best programming language", "the learning process of data mining and python are awesome"]
selected_concepts = ["learning", "python", "programming language", "d", "dat", "data", "data mining", "a m", "machine learning", "l"]
Current Output:
[['l', 'd', 'dat', 'data', 'data mining', 'a m', 'machine learning', 'learning'], ['python', 'programming language', 'l'], ['learning', 'l', 'd', 'dat', 'data', 'data mining', 'a m', 'python']]
I want the output to be;
[["data mining", "machine learning"], ["python", "programming language"], ["learning", "data mining", "python"]]
The issue in my current programme is that it fails to distinguish the overlapping concepts such as d
, dat
, data
and data mining
and to obtain only data mining
as the output.
I am not interested in using regex patterns as it slowed down the process.
Please let me know if any further details are needed.
Upvotes: 2
Views: 220
Reputation: 31319
If I understand your problem correctly, you don't want to include any 'concepts' that are also part of a longer 'concept' you're already finding?
Regex can actually be very efficient and may prove faster than the solution you wrote. However, your solution as you shared can be fixed by simply adding the line:
output = [[w1 for w1 in l if not any([w2 != w1 and w2.find(w1) >= 0 for w2 in l])] for l in output]
That's not very efficient though, since it will still find all the solutions and then run a fairly costly operation to remove all of the duplicates that were included in longer results.
Only sorting the list by length (using regex or otherwise) won't work, since sub-strings may be part of multiple longer strings and should still be found if they are found outside those longer strings. For example, if selected_concepts
is something like ["big example", "small example", "example", "small", "big"]
. Then running for sentence "this big example has a small solution for example"
should still find ["big example", "small", "example"]
.
However, your code has some more problems, since it ignores your requirement that you're only looking for whole word concepts. In your example, if you add "v"
as a concept, it would be found (in love
) and it would not be eliminated as part of another concept. Also, concepts that appear both by themselves and as part of a larger concept are eliminated by the line I provided.
A somewhat better, more complete solution (still without regex):
sentences = ["i love data mining and machine learning", "python is the best programming language",
"the learning process of data mining and python are awesome"]
selected_concepts = ["learning", "python", "programming language", "d", "dat", "data", "data mining", "a m",
"machine learning", "l"]
split_sentences = [s.split() for s in sentences]
split_selected_concepts = [s.split() for s in sorted(selected_concepts, key=len, reverse=True)]
sentence_concepts = []
for s in split_sentences:
concepts = []
for c in split_selected_concepts:
new_s = []
i = 0
while i < len(s):
# if concept is found
if s[i:i + len(c)] == c:
# save it and skip it, so it isn't found again
concepts.append((i, c))
# keep blanks in new_s to ensure correct index for further results
new_s.extend(len(c) * [None])
i += len(c)
else:
# if the current word doesn't start this concept, keep it
new_s.append(s[i])
i += 1
s = new_s
# reorder the found concepts and turn the lists back into strings
sentence_concepts.append([' '.join(x[1]) for x in sorted(concepts, key=lambda x: x[0])])
print(sentence_concepts)
Upvotes: 1
Reputation: 51165
Regex works here. First, sort your list of concepts descending by length before turning it into a regular expression, since the re
module doesn't support overlapping matches. Then when you use re.findall
, the longest words will always be matched first.
import re
r = sorted(selected_concepts, key=len, reverse=True)
rgx = '|'.join([fr'\b{word}\b' for word in r])
[re.findall(rgx, sentence) for sentence in sentences]
[['data mining', 'machine learning'],
['python', 'programming language'],
['learning', 'data mining', 'python']]
Upvotes: 3