EmJ
EmJ

Reputation: 4608

How to identify substrings in the order of the string?

I have a list of sentences as below.

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']

I also have a set of selected concepts.

selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']

Now I want to select the concepts in seleceted_concepts from sentences in the order of the sentence.

i.e. my output should be as follows.

output = [['data mining','process','patterns','methods','machine learning','database systems'],['data mining','interdisciplinary subfield','information'],['data mining','knowledge discovery','databases process']]

I could extract the concepts in the sentences as follows.

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        if item in sentence:
             sentence_tokens.append(item)
    output.append(sentence_tokens)

However, I have troubles of organising the extracted concepts accoridng to sentence order. Is there any easy way of doing it in python?

Upvotes: 1

Views: 88

Answers (5)

himabindu
himabindu

Reputation: 336

Here I used a simple re.findall method if the pattern is matched in the string then re.findall will give the output as that matched pattern otherwise it will return an empty list based on that I wrote this code

import re

selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']

output = []

for sentence in sentences:
    matched_concepts = []
    for selected_concept in selected_concepts:
        if re.findall(selected_concept, sentence):
            matched_concepts.append(selected_concept)
    output.append(matched_concepts)
print output

Output:

[['machine learning', 'patterns', 'data mining', 'methods', 'database systems', 'process'], ['data mining', 'interdisciplinary subfield', 'information'], ['data mining', 'databases process', 'process']]

Upvotes: 1

Amadan
Amadan

Reputation: 198314

You can use the fact that regular expressions search text in order, left to right, and disallow overlaps:

import re
concept_re = re.compile(r'\b(?:' +
    '|'.join(re.escape(concept) for concept in selected_concepts) + r')\b')
output = [match
        for sentence in sentences for match in concept_re.findall(sentence)]

output
# => ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems', 'data mining', 'interdisciplinary subfield', 'information', 'information', 'data mining', 'databases process']

This should also be faster than searching for concepts individually, since the algorithm regexps use is more efficient for this, as well as being completely implemented in low-level code.

There is one difference though - if a concept repeats itself within one sentence, your code will only give one appearance per sentence, while this code outputs them all. If this is a meaningful difference, it is rather easy to dedupe a list.

Upvotes: 1

Kevin Agusto
Kevin Agusto

Reputation: 66

there is a built in statement called "in". it can check is there any string in other string.

sentences = [
'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
'data mining is the analysis step of the knowledge discovery in databases process or kdd'
]

selected_concepts = [
 'machine learning',
 'patterns',
 'data mining',
 'methods','database systems',
 'interdisciplinary subfield','knowledege discovery',
 'databases process',
 'information',
 'process'
 ]

output = [] #prepare the output
for s in sentences: #now lets check each sentences
    output.append(list()) #add a list to output, so it will become multidimensional list
    for c in selected_concepts: #check all selected_concepts
        if c in s: #if there a selected concept in a sentence
            output[-1].append(c) #then add the selected concept to the last list in output

print(output)

Upvotes: 1

Jonhasacat
Jonhasacat

Reputation: 140

You could use .find() and .insert() instead. Something like:

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        pos = sentence.find(item)
        if pos != -1:
             sentence_tokens.insert(pos, item)
    output.append(sentence_tokens)

The only problem would be overlap in the selected_concepts. For example, 'databases process' and 'process'. In this case, they would end up in the opposite of the order they are in in selected_concepts. You could potentially fix this with the following:

output = []
selected_concepts_multiplier = len(selected_concepts)
for sentence in sentences:
    sentence_tokens = []
    for k,item in selected_concepts:
        pos = sentence.find(item)
        if pos != -1:
             sentence_tokens.insert((selected_concepts_multiplier * pos) + k, item)
    output.append(sentence_tokens)

Upvotes: 1

Selcuk
Selcuk

Reputation: 59184

One way to do it is to use .find() method to find the position of the substring and then sort by that value. For example:

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        index = sentence.find(item)
        if index >= 0:
             sentence_tokens.append((index, item))
    sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
    output.append(sentence_tokens)

Upvotes: 1

Related Questions