Reputation: 2853
I am given a set of substrings. I need to find the count of occurrence of all those substrings in a particular column in a dataframe. The relevant datframe would look like this
training['concat']
0 svAxu$paxArWAn
1 xvAxaSa$varRANi
2 AxAna$xurbale
3 go$BakwAH
4 viXi$Bexena
5 nIwi$kuSalaM
6 lafkA$upamam
7 yaSas$lipsoH
8 kaSa$AGAwam
9 hewumaw$uwwaram
10 varRa$pUgAn
My set of substrings is a dictionary, where the keys are the substrings and values are the probabilities with which they occur
reg = {'anuBavAn':0.35, 'a$piwra':0.2 ...... 'piwra':0.7, 'pa':0.03, 'a':0.0005}
#The length of dicitioanry is 2000
Particularly I need to find those substrings which occur more than twice
I have written the following code that performs the task. Is there a more elegant pythonic way or panda specific way to achieve the same as the current implementation is taking quite some time to execute.
elites = dict()
for reg_pat in reg_:
count = 0
eliter = len(training[training['concat'].str.contains(reg_pat)]['concat'])
if eliter >=3:
elites[reg_pat] = reg_[reg_pat]
Upvotes: 3
Views: 1098
Reputation: 637
Hopefully I have interpreted your question correctly. I'm inclined to stay away from regex here (in fact, I've never used it in conjunction with pandas), but it's not wrong, strictly speaking. In any case, I find it hard to believe that any regex operations are faster than a simple in
check, but I could be wrong on that.
for substr in reg:
totalStringAppearances = training.apply((lambda string: substr in string))
totalStringAppearances = totalStringAppearances.sum()
if totalStringAppearances > 2:
reg[substr] = totalStringAppearances / len(training)
else:
# do what you want to with the very rare substrings
Some gotchas:
apply()
, I am using a potentially unreliable exploitation of booleans. See this question for more details.Post-edit: Jezrael's answer is more complete as it uses the same variable names. But, in a simple case, regarding regex vs. apply
and in
, I validate his claim, and my presumption:
Upvotes: 2
Reputation: 862681
You can use apply
instead str.contains
, it is faster:
reg_ = {'anuBavAn':0.35, 'a$piwra':0.2, 'piwra':0.7, 'pa':0.03, 'a':0.0005}
elites = dict()
for reg_pat in reg_:
if training['concat'].apply(lambda x: reg_pat in x).sum() >= 3:
elites[reg_pat] = reg_[reg_pat]
print (elites)
{'a': 0.0005}
Upvotes: 2