Finding the count of a set of substrings in pandas dataframe

Question

I am given a set of substrings. I need to find the count of occurrence of all those substrings in a particular column in a dataframe. The relevant datframe would look like this

  training['concat']

  0 svAxu$paxArWAn
  1 xvAxaSa$varRANi
  2 AxAna$xurbale
  3 go$BakwAH
  4 viXi$Bexena
  5 nIwi$kuSalaM
  6 lafkA$upamam
  7 yaSas$lipsoH
  8 kaSa$AGAwam
  9 hewumaw$uwwaram
  10 varRa$pUgAn

My set of substrings is a dictionary, where the keys are the substrings and values are the probabilities with which they occur

  reg = {'anuBavAn':0.35, 'a$piwra':0.2 ...... 'piwra':0.7, 'pa':0.03, 'a':0.0005}
  #The length of dicitioanry is 2000

Particularly I need to find those substrings which occur more than twice

I have written the following code that performs the task. Is there a more elegant pythonic way or panda specific way to achieve the same as the current implementation is taking quite some time to execute.

  elites = dict()
  for reg_pat in reg_:
  count = 0
  eliter = len(training[training['concat'].str.contains(reg_pat)]['concat'])
  if eliter >=3:
  elites[reg_pat] = reg_[reg_pat]

jezrael · Accepted Answer

You can use apply instead str.contains, it is faster:

reg_ = {'anuBavAn':0.35, 'a$piwra':0.2, 'piwra':0.7, 'pa':0.03, 'a':0.0005}

elites = dict()
for reg_pat in reg_:
  if training['concat'].apply(lambda x: reg_pat in x).sum() >= 3:
      elites[reg_pat] = reg_[reg_pat]

print (elites)
{'a': 0.0005}

Finding the count of a set of substrings in pandas dataframe

Answers (2)

Related Questions