rshar
rshar

Reputation: 1475

Add counter as an additional column in Python pandas dataframe

I have following dataframe as an output of my python script. I would like to add another column with count per pmid and add the counter to the first row, keeping the other rows.

The dataframe looks like this:

df

       PMID gene_symbol    gene_label gene_mentions
0  33377242       MTHFR  Matched Gene             2
1  33414971       CSF3R  Matched Gene            13
2  33414971         BCR    Other Gene             2
3  33414971        ABL1  Matched Gene             1
4  33414971        ESR1  Matched Gene             1
5  33414971      NDUFB3    Other Gene             1
6  33414971        CSF3    Other Gene             1
7  33414971        TP53  Matched Gene             2
8  33414971         SRC  Matched Gene             1
9  33414971        JAK1  Matched Gene             1

Expected out is:

       PMID gene_symbol    gene_label gene_mentions   count
0  33377242       MTHFR  Matched Gene             2   1
1  33414971       CSF3R  Matched Gene            13   9
2  33414971         BCR    Other Gene             2   9
3  33414971        ABL1  Matched Gene             1   9
4  33414971        ESR1  Matched Gene             1   9
5  33414971      NDUFB3    Other Gene             1   9
6  33414971        CSF3    Other Gene             1   9
7  33414971        TP53  Matched Gene             2   9
8  33414971         SRC  Matched Gene             1   9
9  33414971        JAK1  Matched Gene             1   9
10 33414972        MAK2  Matched Gene             1   1

How can I achieve this output?

Thanks

Upvotes: 1

Views: 364

Answers (3)

Quang Hoang
Quang Hoang

Reputation: 150805

You can add count for each row with groupby().transform:

df['count'] = df.groupby('PMID')['PMID'].transform('size')

Output:

       PMID gene_symbol    gene_label  gene_mentions  count
0  33377242       MTHFR  Matched Gene              2      1
1  33414971       CSF3R  Matched Gene             13      9
2  33414971         BCR    Other Gene              2      9
3  33414971        ABL1  Matched Gene              1      9
4  33414971        ESR1  Matched Gene              1      9
5  33414971      NDUFB3    Other Gene              1      9
6  33414971        CSF3    Other Gene              1      9
7  33414971        TP53  Matched Gene              2      9
8  33414971         SRC  Matched Gene              1      9
9  33414971        JAK1  Matched Gene              1      9

Now if you really want only count at the first row for each PMID, you can use mask:

df['count'] = df['count'].mask(df['PMID'].duplicated())

Then you would have:

       PMID gene_symbol    gene_label  gene_mentions  count
0  33377242       MTHFR  Matched Gene              2    1.0
1  33414971       CSF3R  Matched Gene             13    9.0
2  33414971         BCR    Other Gene              2    NaN
3  33414971        ABL1  Matched Gene              1    NaN
4  33414971        ESR1  Matched Gene              1    NaN
5  33414971      NDUFB3    Other Gene              1    NaN
6  33414971        CSF3    Other Gene              1    NaN
7  33414971        TP53  Matched Gene              2    NaN
8  33414971         SRC  Matched Gene              1    NaN
9  33414971        JAK1  Matched Gene              1    NaN

Upvotes: 1

Ángel Igualada
Ángel Igualada

Reputation: 891

You can do it with something like this:

data = [
    {"PMID": "33377242", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33377242", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33377242", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33414971", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33377242", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33414972", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33414971", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"}
]

df = pd.DataFrame(data)
df = df.sort_values("PMID")
df["count"] = df.groupby("PMID")["PMID"].transform("count")
# get position of first element of each group
conditions = df["PMID"] != df["PMID"].shift() 
# remove count of non first row of each group
df["count"][~conditions] = None 

or:

df = df.sort_values("PMID")
# get position of first element of each group
conditions = df["PMID"] != df["PMID"].shift() 
# write count only in first row
df["count"] = None
df["count"][conditions] = df.groupby("PMID")["PMID"].transform("count")[conditions] 

enter image description here

Upvotes: 0

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 96287

You just want to group by the ID and get the size of the group:

df.set_index("PMID", inplace=True)
df['count'] = df.groupby("PMID").size()

And if you want, you can:

df.reset_index(inplace=True)

Upvotes: 0

Related Questions