Reputation: 1475
I have following dataframe as an output of my python script. I would like to add another column with count per pmid and add the counter to the first row, keeping the other rows.
The dataframe looks like this:
df
PMID gene_symbol gene_label gene_mentions
0 33377242 MTHFR Matched Gene 2
1 33414971 CSF3R Matched Gene 13
2 33414971 BCR Other Gene 2
3 33414971 ABL1 Matched Gene 1
4 33414971 ESR1 Matched Gene 1
5 33414971 NDUFB3 Other Gene 1
6 33414971 CSF3 Other Gene 1
7 33414971 TP53 Matched Gene 2
8 33414971 SRC Matched Gene 1
9 33414971 JAK1 Matched Gene 1
Expected out is:
PMID gene_symbol gene_label gene_mentions count
0 33377242 MTHFR Matched Gene 2 1
1 33414971 CSF3R Matched Gene 13 9
2 33414971 BCR Other Gene 2 9
3 33414971 ABL1 Matched Gene 1 9
4 33414971 ESR1 Matched Gene 1 9
5 33414971 NDUFB3 Other Gene 1 9
6 33414971 CSF3 Other Gene 1 9
7 33414971 TP53 Matched Gene 2 9
8 33414971 SRC Matched Gene 1 9
9 33414971 JAK1 Matched Gene 1 9
10 33414972 MAK2 Matched Gene 1 1
How can I achieve this output?
Thanks
Upvotes: 1
Views: 364
Reputation: 150805
You can add count for each row with groupby().transform
:
df['count'] = df.groupby('PMID')['PMID'].transform('size')
Output:
PMID gene_symbol gene_label gene_mentions count
0 33377242 MTHFR Matched Gene 2 1
1 33414971 CSF3R Matched Gene 13 9
2 33414971 BCR Other Gene 2 9
3 33414971 ABL1 Matched Gene 1 9
4 33414971 ESR1 Matched Gene 1 9
5 33414971 NDUFB3 Other Gene 1 9
6 33414971 CSF3 Other Gene 1 9
7 33414971 TP53 Matched Gene 2 9
8 33414971 SRC Matched Gene 1 9
9 33414971 JAK1 Matched Gene 1 9
Now if you really want only count at the first row for each PMID
, you can use mask
:
df['count'] = df['count'].mask(df['PMID'].duplicated())
Then you would have:
PMID gene_symbol gene_label gene_mentions count
0 33377242 MTHFR Matched Gene 2 1.0
1 33414971 CSF3R Matched Gene 13 9.0
2 33414971 BCR Other Gene 2 NaN
3 33414971 ABL1 Matched Gene 1 NaN
4 33414971 ESR1 Matched Gene 1 NaN
5 33414971 NDUFB3 Other Gene 1 NaN
6 33414971 CSF3 Other Gene 1 NaN
7 33414971 TP53 Matched Gene 2 NaN
8 33414971 SRC Matched Gene 1 NaN
9 33414971 JAK1 Matched Gene 1 NaN
Upvotes: 1
Reputation: 891
You can do it with something like this:
data = [
{"PMID": "33377242", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
{"PMID": "33377242", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
{"PMID": "33377242", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
{"PMID": "33414971", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
{"PMID": "33377242", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
{"PMID": "33414972", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
{"PMID": "33414971", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"}
]
df = pd.DataFrame(data)
df = df.sort_values("PMID")
df["count"] = df.groupby("PMID")["PMID"].transform("count")
# get position of first element of each group
conditions = df["PMID"] != df["PMID"].shift()
# remove count of non first row of each group
df["count"][~conditions] = None
or:
df = df.sort_values("PMID")
# get position of first element of each group
conditions = df["PMID"] != df["PMID"].shift()
# write count only in first row
df["count"] = None
df["count"][conditions] = df.groupby("PMID")["PMID"].transform("count")[conditions]
Upvotes: 0
Reputation: 96287
You just want to group by the ID and get the size of the group:
df.set_index("PMID", inplace=True)
df['count'] = df.groupby("PMID").size()
And if you want, you can:
df.reset_index(inplace=True)
Upvotes: 0