DJF
DJF

Reputation: 107

Pandas: for loop through columns

My data looks like:

SNP Name ss715583617 ss715592335 ss715591044 ss715598181
4    PI081762           T           A           A           T
5   PI101404A           T           A           A           T
6   PI101404B           T           A           A           T
7    PI135624           T           A           A           T
8    PI326581           T           A           A           T
9   PI326582A           T           A           A           T
10  PI326582B           T           A           A           T
11   PI339732           T           A           A           T
12  PI339735A           T           A           A           T
13  PI339735B           T           A           A           T
14  PI342618A           T           A           A           T

In reality I have a dataset of 50,000 columns of 479 rows. My objective is to go through each column with characters and convert the data to integers depending on which is the most abundant character.

Right now I have the data input, and I have more or less written the function I would like to use to analyze each column separately. However, I can't quite understand how to use a forloop or use the apply function through all of the columns in the dataset. I would prefer not to hardcode the columns because I will have 40,000~50,000 columns to analyze.

My code so far is:

import pandas as pd

df = pd.read_csv("/home/dfreese/Desktop/testSNPtext", delimiter='\t')

df.head() # check that the file format fits

# ncol df
df2 = df.iloc[4:-1] # Select the rows you want to analyze in a subset df
print(df2)

My function:

def countAlleles(N): 
    # N is just suppose to be the column, ideally once I've optimized the function
    # I need to analyze every column 

    # Will hold the counts of each letter in the column
    letterCount = [] 
    # This is a parallel array to know the order
    letterOrder = {'T','A','G','C','H','U'}  


    # Boolean to use which one is the maximum 
    TFlag = None 
    AFlag = None 
    GFlag = None
    CFlag = None
    HFlag = None
    UFlag = None 

# Loop through the column to determine which one is the maximum 
for i in range(len(N)): # How do I get index information of the column?
    if(N[i] == 'T'): # If the element in the column is T
        letterCount[0] = letterCount[0] + 1
    elif(N[i] == 'A'):
        letterCount[1] = letterCount [1] + 1
    elif (N[i] == 'G'):
        letterCount[2] = letterCount [2] + 1
    elif (N[i] == 'C'):
        lettercount[3] = letterCount[3] + 1
    elif(N[i] == 'H'):
        letterCount[4] = letterCount[4] + 1
    else:
        letterCount[5] = letterCount[5] + 1

max = letterCount[0] # This will hold the value of maximum
mIndex = 0 # This holds the index position with the max value

# Determine which one is max
for i in range(len(letterCount)):
    if (letterCount[i] > max):
        max = letterCount[i]
        mIndex = i

So I designed the function to input the column, in hopes to be able to iterate through all the columns of the dataframe. My main question is:

1) How would I pass each in each column as a parameter to the for loop through the elements of each column?

My major source of confusion is how indexes are being used in pandas. I'm familiar with 2-dimensional array in C++ and Java and that is most of where my knowledge stems from.

I'm attempting to use the apply function:

df2 = df2.apply(countAlleles('ss715583617'), axis=2)

but it doesn't seem that my application is correct.

Upvotes: 2

Views: 14132

Answers (2)

dermen
dermen

Reputation: 5362

To iterate over columns in e.g. a for loop, use list(df). Anyhow, you can easily do what you are attempting using collections.Counter

assume a dataframe df

df
#         Name ss715583617 ss715592335 ss715591044 ss715598181
#0    PI081762           T           A           A           T
#1   PI101404A           T           A           A           T
#2   PI101404B           T           A           A           T
#3    PI135624           T           A           A           T
#4    PI326581           T           A           F           D
#5   PI326582A           G           A           F           T
#6   PI326582B           G           A           A           T
#7    PI339732           D           H           A           T
#8   PI339735A           D           A           A           T
#9   PI339735B           A           A           A           T
#10  PI342618A           D           A           A           T

What I gather from the comments sections and your original post, you want to replace each character in each column according to it's frequency of occurrence. This is one approach:

Make the Counters

from collections import Counter

cols       = [ col for col in list(df) if col not in ['Name'] ] # all the column you want to operate on
col_counters = { col: Counter( df[col] ) for col in cols } 
#{'ss715583617': Counter({'T': 5, 'D': 3, 'G': 2, 'A': 1}),
# 'ss715591044': Counter({'A': 9, 'F': 2}),
# 'ss715592335': Counter({'A': 10, 'H': 1}),
# 'ss715598181': Counter({'T': 10, 'D': 1})}

Sort the items in each Counter

sort_func = lambda items: sorted(items, key=lambda x:x[1], reverse=True ) # sort a nested list according to second element in each sublist
sort_result = { col: sort_func(counter.items()) for col,counter in col_counters.iteritems() }
#{'ss715583617': [('T', 5), ('D', 3), ('G', 2), ('A', 1)],
# 'ss715591044': [('A', 9), ('F', 2)],
# 'ss715592335': [('A', 10), ('H', 1)],
# 'ss715598181': [('T', 10), ('D', 1)]}

Replace letters in dataframe according to sort result

Here we will use enumerate to get the position of each sort result

mapper = { col: {letter:i+1 for i,letter in enumerate(sort_result[col]) } for col in sort_result } 
#{'ss715583617': {'A': 4, 'D': 2, 'G': 3, 'T': 1},
# 'ss715591044': {'A': 1, 'F': 2},
# 'ss715592335': {'A': 1, 'H': 2},
# 'ss715598181': {'D': 2, 'T': 1}}

df.replace( to_replace=mapper, inplace=True)
#         Name  ss715583617  ss715592335  ss715591044  ss715598181
#0    PI081762            1            1            1            1
#1   PI101404A            1            1            1            1
#2   PI101404B            1            1            1            1
#3    PI135624            1            1            1            1
#4    PI326581            1            1            2            2
#5   PI326582A            3            1            2            1
#6   PI326582B            3            1            1            1
#7    PI339732            2            2            1            1
#8   PI339735A            2            1            1            1
#9   PI339735B            4            1            1            1
#10  PI342618A            2            1            1            1

This should be enough to get you on your way. I am not sure how you want to handle duplicate elements, for instance if a column has the same number if T and G.

Upvotes: 2

chris-sc
chris-sc

Reputation: 1718

Updated answer: Now the dataframe is analyzed and replaced with the int values according to the occurences of a allele per column. The problem with what happens if one allele has the same number of occurences than the other is still the same - the assignment will be not unique.

import pandas as pd                                                              
import numpy as np                                                               

df = pd.DataFrame.from_dict({"ss1": ["T", "T", "T", "G"],                        
                             "ss2": ["G", "G", "T", "A"],                        
                             "ss3": ["C", "H", "C", "H"]})                       

letterOrder = np.array(['T', 'A', 'G', 'C', 'H', 'U'])                           

for col in df:                                                                   
    alleles = list()                                                             
    for num, allele in enumerate(letterOrder):                                   
        alleles.append(df[col].str.count(allele).sum())                          

    # dictionary with full sorted keys                                           
    repl = letterOrder[np.argsort(alleles)][::-1]                                
    # directly replace chars by value                                            
    for num, char in enumerate(repl):                                            
        df[col].replace(char, num+1, inplace=True)                               

print(df)

This will change the initial dataframe

  ss1 ss2 ss3
0   T   G   C
1   T   G   H
2   T   T   C
3   G   A   H

to the new dataframe with ints sorted according to the number of occurences:

   ss1  ss2  ss3
0    1    1    2
1    1    1    1
2    1    3    2
3    2    2    1

For reference the old answer which gives the maximum column indices:

import pandas as pd                                                              
import numpy as np                                                               
from collections import OrderedDict                                              

df = pd.DataFrame.from_dict({"ss1": ["T", "T", "T", "G"],                                     
                             "ss2": ["G", "G", "T", "A"],                           
                             "ss3": ["C", "H", "C", "H"]})                          

letterOrder = ['T', 'A', 'G', 'C', 'H', 'U']                                        

full_results = OrderedDict()                                                        
for col in df:                                                                      
    alleles = list()                                                                
    for num, allele in enumerate(letterOrder):                                      
        alleles.append(df[col].str.count(allele).sum())                             

    full_results[col] = [letterOrder[np.argmax(alleles)], np.max(alleles)]       

print(full_results)

This will give:

OrderedDict([('ss1', ['T', 3]), ('ss2', ['G', 2]), ('ss3', ['C', 2])])

The key in the dict is the name of your column, and the value is a list with [allele, number_of_occurences].

I used OrderedDict to keep the order of your columns and the name, but if you don't need the order, you can use a dict, or if you don't need the column name (and the implicit ID is enough), use a list.

But be careful: If in one column two (or more) characters have the same number of counts, this will only return one of them. You would need to add an additional test for this.

Upvotes: 2

Related Questions