Mi.
Mi.

Reputation: 510

Counting Words in a Column in a DataFrame

I have a DataFrames like shown below:

DF1 =

 sID   token     A  B  C  D
  10    I am     a  f  g  h
  10    here     a  g  g  h
  10    whats    a  h  g  h
  10    going    a  o  g  h
  10    on       a  j  g  h
  10    .        a  f  g  h
  11    I am     a  f  g  h
  11    foo bar  a  f  g  h
  12    You are  a  f  g  h
  ...

The columns (A-D) don't matter regarding this task. Is there a way to add a counter column which counts the words (delimited by white space) to the DataFrame. That column should start counting the amount of tokens for each sID. Meaning it resets every time the value of sID changes.

Usually I would just use DF.groupby("sID").cumcount() but this only counts the amount of rows for each sID.

The result should look like this:

DF2 =

 sID   token     A  B  C  D   Counter
  10    I am     a  f  g  h    0 1
  10    here     a  g  g  h    2
  10    whats    a  h  g  h    3
  10    going    a  o  g  h    4
  10    on       a  j  g  h    5
  10    .        a  f  g  h    6
  11    I am     a  f  g  h    0 1
  11    foo bar  a  f  g  h    2 3
  12    You are  a  f  g  h    0 1
  ...

Upvotes: 1

Views: 1523

Answers (2)

jpp
jpp

Reputation: 164653

Using groupby + itertools:

from itertools import chain, count

df = pd.DataFrame({'sID': [10, 10, 10, 10, 10, 10, 11, 11, 12],
                   'token': ['I am', 'here', 'whats', 'going',
                             'on', '.', 'I am', 'foo bar', 'You are']})

def counter(df):
    for k, g in df.groupby('sID')['token']:
        c = count()
        lens = g.str.split().map(len)
        yield [' '.join([str(next(c)) for _ in range(n)]) for n in lens]

df['Counts'] = list(chain.from_iterable(counter(df)))

Result

print(df)

   sID    token Counts
0   10     I am    0 1
1   10     here      2
2   10    whats      3
3   10    going      4
4   10       on      5
5   10        .      6
6   11     I am    0 1
7   11  foo bar    2 3
8   12  You are    0 1

Explanation

  • Initialise an itertools.count counter for each group.
  • Split by whitespace and count number of words via str.split and len.
  • Use a nested list comprehension for each group to recover counts.
  • Chain result using itertools.chain.

Upvotes: 3

Ben.T
Ben.T

Reputation: 29635

Before using groupby("sID").cumcount() you need to do some manipulation to keep wich row the words, once split, they belongs to. So, you can create your column 'Counter' like this:

df['Counter']= (df.set_index('sID',append=True)['token']
                  .str.split(' ',expand=True).stack()
                  .groupby('sID').cumcount()
                  .groupby(level=0).apply(lambda x: ' '.join([str(i) for i in x])))

and you get the expected output

Upvotes: 3

Related Questions