Amaranta_Remedios
Amaranta_Remedios

Reputation: 773

Loop pandas directory

I have many csv files in a directory with two column each

miRNA  read_counts  
miR1      10
miR1      5
miR2      2
miR2      3
miR3     100

I would like to sum read_counts if the miRNA id is the same.

Result:

miRNA  read_counts  
miR1      15
miR2      5
miR3     100

To do that I wrote a little script. However I don't know how to loop it through all my csv files so I don't have to copy paste file names and output each time. Any help will be very appreciated. Thanks for the help!

import pandas as pd

df = pd.read_csv("modified_LC1a_miRNA_expressed.csv")
df_new = df.groupby('miRNA')['read_count'].sum()
print(df_new)
df_new.to_csv('sum_LC1a_miRNA_expressed.csv')

Upvotes: 0

Views: 44

Answers (2)

Haliaetus
Haliaetus

Reputation: 490

Not trying to steal the answer. I would have put this in a comment under @Asif Ali's answer if I had enough rep.

Assuming all input .csv files follow the format: "modified_{rest_of_the_file_name}.csv"

And you want the outputs to be: "sum_{same_rest_of_the_file_name}.csv"

import os
import glob

path = "./your/path"
files = glob.glob(os.path.join(path, "*.csv"))

for file in files:
    df = pd.read_csv(file)
    df_new = df.groupby('miRNA')['read_count'].sum()
    print(df_new)
    df_new.to_csv(file.split('modified')[:-1] + \
                  'sum' + \
                  '_'.join(file.split('modified')[-1:]))

Upvotes: 0

Asif Ali
Asif Ali

Reputation: 1432

Try looking into glob module.

from glob import glob
import os

path = "./your/path"
files = glob(os.path.join(path, "*.csv"))

dataframes = []
for file in files:
    df = pd.read_csv(file)
    # rest you would want to append these to dataframes
    dataframes.append(df)

Then, use pd.concat to join the dataframes and perform the groupby operation.

EDIT 1: Based on the request mentioned in the comment:

results = {}
for file in files:
    df = pd.read_csv(file)
    # perform operation
    df_new = df.groupby('miRNA')['read_count'].sum()
    results[file] = df_new

Upvotes: 1

Related Questions