Counting Words in a Column in a DataFrame

Question

I have a DataFrames like shown below:

DF1 =

 sID   token     A  B  C  D
  10    I am     a  f  g  h
  10    here     a  g  g  h
  10    whats    a  h  g  h
  10    going    a  o  g  h
  10    on       a  j  g  h
  10    .        a  f  g  h
  11    I am     a  f  g  h
  11    foo bar  a  f  g  h
  12    You are  a  f  g  h
  ...

The columns (A-D) don't matter regarding this task. Is there a way to add a counter column which counts the words (delimited by white space) to the DataFrame. That column should start counting the amount of tokens for each sID. Meaning it resets every time the value of sID changes.

Usually I would just use DF.groupby("sID").cumcount() but this only counts the amount of rows for each sID.

The result should look like this:

DF2 =

 sID   token     A  B  C  D   Counter
  10    I am     a  f  g  h    0 1
  10    here     a  g  g  h    2
  10    whats    a  h  g  h    3
  10    going    a  o  g  h    4
  10    on       a  j  g  h    5
  10    .        a  f  g  h    6
  11    I am     a  f  g  h    0 1
  11    foo bar  a  f  g  h    2 3
  12    You are  a  f  g  h    0 1
  ...

Ben.T · Accepted Answer

Before using groupby("sID").cumcount() you need to do some manipulation to keep wich row the words, once split, they belongs to. So, you can create your column 'Counter' like this:

df['Counter']= (df.set_index('sID',append=True)['token']
                  .str.split(' ',expand=True).stack()
                  .groupby('sID').cumcount()
                  .groupby(level=0).apply(lambda x: ' '.join([str(i) for i in x])))

and you get the expected output

Counting Words in a Column in a DataFrame

Answers (2)

Related Questions