Reputation: 13
Question from a python beginner! I have a tsv file looking like this:
WHI5 YOR083W CDC28 YBR160W physical interactions 19823668
WHI5 YOR083W CDC28 YBR160W physical interactions 21658602
WHI5 YOR083W CDC28 YBR160W physical interactions 24186061
WHI5 YOR083W RPD3 YNL330C physical interactions 19823668
WHI5 YOR083W SWI4 YER111C physical interactions 15210110
WHI5 YOR083W SWI4 YER111C physical interactions 15210111
I would like to count all the lines containing the same word in row[3], and only output the first one with the number of occurrence in a new column.
WHI5 YOR083W CDC28 YBR160W physical interactions 19823668 3
WHI5 YOR083W RPD3 YNL330C physical interactions 19823668 1
WHI5 YOR083W SWI4 YER111C physical interactions 15210110 2
So far I tried a combination of 'csv' and 'Counter' or 'pandas' and 'Counter' without success...
Upvotes: 1
Views: 995
Reputation: 77941
using pandas:
>>> import pandas as pd
>>> from io import BytesIO
>>> df = pd.read_table(BytesIO("""\
... col1 col2 col3 col4 col5 col6
... WHI5 YOR083W CDC28 YBR160W "physical interactions" 19823668
... WHI5 YOR083W CDC28 YBR160W "physical interactions" 21658602
... WHI5 YOR083W CDC28 YBR160W "physical interactions" 24186061
... WHI5 YOR083W RPD3 YNL330C "physical interactions" 19823668
... WHI5 YOR083W SWI4 YER111C "physical interactions" 15210110
... WHI5 YOR083W SWI4 YER111C "physical interactions" 15210111"""),
... delim_whitespace=True)
the pandas data-frame will look like:
>>> df
col1 col2 col3 col4 col5 col6
0 WHI5 YOR083W CDC28 YBR160W physical interactions 19823668
1 WHI5 YOR083W CDC28 YBR160W physical interactions 21658602
2 WHI5 YOR083W CDC28 YBR160W physical interactions 24186061
3 WHI5 YOR083W RPD3 YNL330C physical interactions 19823668
4 WHI5 YOR083W SWI4 YER111C physical interactions 15210110
5 WHI5 YOR083W SWI4 YER111C physical interactions 15210111
[6 rows x 6 columns]
to get the count, group by col3
and take the length of each group:
>>> df['cnt'] = df.groupby('col3')['col3'].transform(len)
>>> df
col1 col2 col3 col4 col5 col6 cnt
0 WHI5 YOR083W CDC28 YBR160W physical interactions 19823668 3
1 WHI5 YOR083W CDC28 YBR160W physical interactions 21658602 3
2 WHI5 YOR083W CDC28 YBR160W physical interactions 24186061 3
3 WHI5 YOR083W RPD3 YNL330C physical interactions 19823668 1
4 WHI5 YOR083W SWI4 YER111C physical interactions 15210110 2
5 WHI5 YOR083W SWI4 YER111C physical interactions 15210111 2
[6 rows x 7 columns]
to pick the first of each group:
>>> df.groupby('col3').apply(lambda obj: obj.head(n=1))
col1 col2 col3 col4 col5 col6 cnt
col3
CDC28 0 WHI5 YOR083W CDC28 YBR160W physical interactions 19823668 3
RPD3 3 WHI5 YOR083W RPD3 YNL330C physical interactions 19823668 1
SWI4 4 WHI5 YOR083W SWI4 YER111C physical interactions 15210110 2
[3 rows x 7 columns]
Upvotes: 3