Counting Unique Names in a Pandas Data Frame

Question

I have simplified the large data frame to this simple data frame:

IDX POS     REF ALT
13  633     C   A
15  643     C   T
42  2015    G   A
43  2016    G   A
151 9538    T   C
154 9542    TC  TCC,T
169 10041   T   A
170 10041   T   TAA,TA

The data is from a genomic region with nucleotide position and the reference genome nucleotide and alternative nucleotides from different people for that same position. I have that some positions(9542 and 10041) have two different nucleotides alternatives.

I want to iterate through the ALT column and count the number of unique nucleotides to make a separate column with the counts. I haven't seen how this can be done using python pandas.

The new data frame will then look like this:

IDX POS     REF ALT   COUNT
13  633     C   A        1
15  643     C   T        1
42  2015    G   A        1
43  2016    G   A        1
151 9538    T   C        1
154 9542    TC  TCC,T    2
169 10041   T   A        1
170 10041   T   TAA,TA   2

How will it be possible to do this with Pandas (or just python)?

Thank you.

Rodrigo

piRSquared · Accepted Answer

I'd count the commas and add 1

df['COUNT'] = df.ALT.str.count(',') + 1

Counting Unique Names in a Pandas Data Frame

Answers (1)

Related Questions