Reputation: 342
Scenario: I have some tasks performed for respective "Section Header"(Stored as String), result of that task has to be saved against same respective "Existing Section Header"(Stored as String)
While mapping if respective task's "Section Header" is one of the "Existing Section Header" task results are added to it. And if not, new Section Header will get appended to the Existing Section Header List.
Existing Section Header Looks Like This:
[ "Activity (Last 3 Days)", "Activity (Last 7 days)", "Executable running from disk", "Actions from File"]
For below set of String the expected behaviour is as follows:
"Activity (Last 30 Days) - New Section Should be Added
"Executables running from disk" - Same existing "Executable running from disk" should be referred [considering extra "s" in Executables same as "Executable".
"Actions from a file" - Same existing "Actions from file" should be referred [Considering extra article "a"]
Is there any built-in function available python that may help incorporate same logic. Or any suggestion regarding Algorithm for this is highly appreciated.
Upvotes: 1
Views: 658
Reputation: 8709
Since you want to compare only stem or "root word" of a given word, I suggest using some stemming algorithm. Stemming algorithms attempt to automatically remove suffixes (and in some cases prefixes) in order to find the "root word" or stem of a given word. This is useful in various natural language processing scenarios, such as search. Luckily there is a python package for stemming
. You can download it from here.
Next you want to compare string without stop-words (a,an,the,from, etc.). So you need to filter these words before comparing strings. You can get a list of stop-words from internet or you can use nltk
package to import stop-words list. You can get nltk
from here
If there is any issue with nltk
, here is the list of stop words:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself',
'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be',
'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an',
'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all',
'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don',
'should', 'now']
Now use this simple code to get your desired output:
from stemming.porter2 import stem
from nltk.corpus import stopwords
stopwords_ = stopwords.words('english')
def addString(x):
flag = True
y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
for i in section:
i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
if y==i:
flag = False
break
if flag:
section.append(x)
print "\tNew Section Added"
Demo:
>>> from stemming.porter2 import stem
>>> from nltk.corpus import stopwords
>>> stopwords_ = stopwords.words('english')
>>>
>>> def addString(x):
... flag = True
... y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
... for i in section:
... i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
... if y==i:
... flag = False
... break
... if flag:
... section.append(x)
... print "\tNew Section Added"
...
>>> section = [ "Activity (Last 3 Days)", "Activity (Last 7 days)", "Executable running from disk", "Actions from File"] # initial Section list
>>> addString("Activity (Last 30 Days)")
New Section Added
>>> addString("Executables running from disk")
>>> addString("Actions from a file")
>>> section
['Activity (Last 3 Days)', 'Activity (Last 7 days)', 'Executable running from disk', 'Actions from File', 'Activity (Last 30 Days)'] # Final section list
Upvotes: 0
Reputation: 3113
This is a case where you may find regular expressions helpful. You can use re.sub()
to find specific substrings and replace them. It will search for non-overlapping matches to a regular expression and repaces it with the specified string.
import re #this will allow you to use regular expressions
def modifyHeader(header):
#change the # of days to 30
modifiedHeader = re.sub(r"Activity (Last \d+ Days?)", "Activity (Last 30 Days)", header)
#add an s to "executable"
modifiedHeader = re.sub(r"Executable running from disk", "Executables running from disk", modifiedHeader)
#add "a"
modifiedHeader = re.sub(r"Actions from File", "Actions from a file", modifiedHeader)
return modifiedHeader
The r""
refers to raw strings which make it a bit easier to deal with the \
characters needed for regular expressions, \d
matches any digit character, and +
means "1 or more". Read the page I linked above for more information.
Upvotes: 1