BioProgram
BioProgram

Reputation: 704

concatenate pandas dataframe in a loop of files

I am trying to write a script that loops over files via a certain pattern/variable, then it concatenates the 8th column of the files while keeping the first 4 columns which are common to all files. The script works if I use the following command:

reader = csv.reader(open("1isoforms.fpkm_tracking.txt", 'rU'), delimiter='\t') #to read the header names so i can use them as index. all headers for the three files are the same
header_row = reader.next() # Gets the header
df1 =  pd.read_csv("1isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #1 with index as first 5 columns
df2 = pd.read_csv("2isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #2 with index as first 5 columns
df3 = pd.read_csv("3isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #3 with index as first 5 columns

result = pd.concat([df1.ix[:,4], df2.ix[:,4]], keys=["Header1", "Header2", "Header3"], axis=1) #concatenates the 8th column of the files and changes the header
result.to_csv("OutputTest.xls", sep="\t")

While this works, it is NOT practical for me to enter file names one by one as I sometimes have 100's of files, so cant type in a df...function for each. Instead, I was trying to use a for loop to do this but i couldnt figure it out. here is what I have so far:

k=0
for geneFile in glob.glob("*_tracking*"):
    while k < 3:
         reader = csv.reader(open(geneFile, 'rU'), delimiter='\t')
         header_row = reader.next()
         key = str(k)
         key = pd.read_csv(geneFile, index_col=header_row[0:1], sep="\t")
   result = pd.concat([key[:,5]], axis=1)
   result.to_csv("test2.xls", sep="\t")

However, this is not working .

The issues I am facing are as follows:

  1. How can I be able to iterate over input files and generate different variable names for each which I can then have it used in the pd.concat function one after the other?

  2. How can I use a for loop to generate a string file name that is a combination of df and an integer

  3. How can I fix the above script get my desired item.

  4. A minor issue is regarding the way I am using the col_index function: is there a way to use the column # rather than column names? I know it works for index_col=0 or any single #. But I couldn't use integers for > 1 column of indexing.

Note that all files have the exact same structure, and the index columns are the same.

Your feedback is highly appreciated.

Upvotes: 5

Views: 7401

Answers (1)

Parfait
Parfait

Reputation: 107567

Consider using merge with right_index and left_index arguments:

import pandas as pd

numberoffiles = 100

# FIRST IMPORT (CREATE RESULT DATA FRAME)
result = pd.read_csv("1isoforms.fpkm_tracking.txt", sep="\t",
                      index_col=[0,1,2,3], usecols=[0,1,2,3,7])

# ALL OTHER IMPORTS (MERGE TO RESULT DATA FRAME, 8TH COLUMN SUFFIXED ITERATIVELY)
for i in range(2,numberoffiles+1):    
    df = pd.read_csv("{}isoforms.fpkm_tracking.txt".format(i), sep="\t",
                     index_col=[0,1,2,3], usecols=[0,1,2,3,7])

    result = pd.merge(result, df, right_index=True, left_index=True, suffixes=[i-1, i])

result.to_excel("Output.xlsx")
result.to_csv("Output.csv")

Upvotes: 2

Related Questions