Amrith Krishna
Amrith Krishna

Reputation: 2853

Finding the count of a set of substrings in pandas dataframe

I am given a set of substrings. I need to find the count of occurrence of all those substrings in a particular column in a dataframe. The relevant datframe would look like this

  training['concat']

  0 svAxu$paxArWAn
  1 xvAxaSa$varRANi
  2 AxAna$xurbale
  3 go$BakwAH
  4 viXi$Bexena
  5 nIwi$kuSalaM
  6 lafkA$upamam
  7 yaSas$lipsoH
  8 kaSa$AGAwam
  9 hewumaw$uwwaram
  10 varRa$pUgAn

My set of substrings is a dictionary, where the keys are the substrings and values are the probabilities with which they occur

  reg = {'anuBavAn':0.35, 'a$piwra':0.2 ...... 'piwra':0.7, 'pa':0.03, 'a':0.0005}
  #The length of dicitioanry is 2000

Particularly I need to find those substrings which occur more than twice

I have written the following code that performs the task. Is there a more elegant pythonic way or panda specific way to achieve the same as the current implementation is taking quite some time to execute.

  elites = dict()
  for reg_pat in reg_:
  count = 0
  eliter = len(training[training['concat'].str.contains(reg_pat)]['concat'])
  if eliter >=3:
  elites[reg_pat] = reg_[reg_pat]

Upvotes: 3

Views: 1098

Answers (2)

StarFox
StarFox

Reputation: 637

Hopefully I have interpreted your question correctly. I'm inclined to stay away from regex here (in fact, I've never used it in conjunction with pandas), but it's not wrong, strictly speaking. In any case, I find it hard to believe that any regex operations are faster than a simple in check, but I could be wrong on that.

for substr in reg:
    totalStringAppearances = training.apply((lambda string: substr in string))
    totalStringAppearances = totalStringAppearances.sum()
    if totalStringAppearances > 2:
        reg[substr] = totalStringAppearances / len(training)
    else:
        # do what you want to with the very rare substrings

Some gotchas:

  • If you wanted something like a substring 'a' in 'abcdefa' to return 2, then this will not work. It merely checks for existence of the substring in each string.
  • Inside the apply(), I am using a potentially unreliable exploitation of booleans. See this question for more details.

Post-edit: Jezrael's answer is more complete as it uses the same variable names. But, in a simple case, regarding regex vs. apply and in, I validate his claim, and my presumption:

enter image description here

Upvotes: 2

jezrael
jezrael

Reputation: 862681

You can use apply instead str.contains, it is faster:

reg_ = {'anuBavAn':0.35, 'a$piwra':0.2, 'piwra':0.7, 'pa':0.03, 'a':0.0005}

elites = dict()
for reg_pat in reg_:
  if training['concat'].apply(lambda x: reg_pat in x).sum() >= 3:
      elites[reg_pat] = reg_[reg_pat]

print (elites)
{'a': 0.0005}

Upvotes: 2

Related Questions