Counting occurrence of a word in a column of a tsv file using python

Question

Question from a python beginner! I have a tsv file looking like this:

WHI5    YOR083W CDC28   YBR160W physical interactions   19823668
WHI5    YOR083W CDC28   YBR160W physical interactions   21658602
WHI5    YOR083W CDC28   YBR160W physical interactions   24186061
WHI5    YOR083W RPD3    YNL330C physical interactions   19823668
WHI5    YOR083W SWI4    YER111C physical interactions   15210110
WHI5    YOR083W SWI4    YER111C physical interactions   15210111

I would like to count all the lines containing the same word in row[3], and only output the first one with the number of occurrence in a new column.

WHI5    YOR083W CDC28   YBR160W physical interactions   19823668    3
WHI5    YOR083W RPD3    YNL330C physical interactions   19823668    1
WHI5    YOR083W SWI4    YER111C physical interactions   15210110    2

So far I tried a combination of 'csv' and 'Counter' or 'pandas' and 'Counter' without success...

behzad.nouri · Accepted Answer

using pandas:

>>> import pandas as pd
>>> from io import BytesIO
>>> df = pd.read_table(BytesIO("""\
... col1 col2 col3 col4 col5 col6
... WHI5    YOR083W CDC28   YBR160W "physical interactions"   19823668
... WHI5    YOR083W CDC28   YBR160W "physical interactions"   21658602
... WHI5    YOR083W CDC28   YBR160W "physical interactions"   24186061
... WHI5    YOR083W RPD3    YNL330C "physical interactions"   19823668
... WHI5    YOR083W SWI4    YER111C "physical interactions"   15210110
... WHI5    YOR083W SWI4    YER111C "physical interactions"   15210111"""),
... delim_whitespace=True)

the pandas data-frame will look like:

>>> df
   col1     col2   col3     col4                   col5      col6
0  WHI5  YOR083W  CDC28  YBR160W  physical interactions  19823668
1  WHI5  YOR083W  CDC28  YBR160W  physical interactions  21658602
2  WHI5  YOR083W  CDC28  YBR160W  physical interactions  24186061
3  WHI5  YOR083W   RPD3  YNL330C  physical interactions  19823668
4  WHI5  YOR083W   SWI4  YER111C  physical interactions  15210110
5  WHI5  YOR083W   SWI4  YER111C  physical interactions  15210111

[6 rows x 6 columns]

to get the count, group by col3 and take the length of each group:

>>> df['cnt'] = df.groupby('col3')['col3'].transform(len)
>>> df
   col1     col2   col3     col4                   col5      col6 cnt
0  WHI5  YOR083W  CDC28  YBR160W  physical interactions  19823668   3
1  WHI5  YOR083W  CDC28  YBR160W  physical interactions  21658602   3
2  WHI5  YOR083W  CDC28  YBR160W  physical interactions  24186061   3
3  WHI5  YOR083W   RPD3  YNL330C  physical interactions  19823668   1
4  WHI5  YOR083W   SWI4  YER111C  physical interactions  15210110   2
5  WHI5  YOR083W   SWI4  YER111C  physical interactions  15210111   2

[6 rows x 7 columns]

to pick the first of each group:

>>> df.groupby('col3').apply(lambda obj: obj.head(n=1))
         col1     col2   col3     col4                   col5      col6 cnt
col3
CDC28 0  WHI5  YOR083W  CDC28  YBR160W  physical interactions  19823668   3
RPD3  3  WHI5  YOR083W   RPD3  YNL330C  physical interactions  19823668   1
SWI4  4  WHI5  YOR083W   SWI4  YER111C  physical interactions  15210110   2

[3 rows x 7 columns]

Counting occurrence of a word in a column of a tsv file using python

Answers (1)

Related Questions