Jose Ramon
Jose Ramon

Reputation: 5444

Process multiple csv files on pandas

I have got three different .csv files which contain the grades for students in three different assignment. I would like to read them with pandas and calculate the average for each student. The template for each file is:

Student id, Mark, extra fields, ...
4358975489, 9,  ... ...
2345234523, 10,  ... ...
7634565323, 7,  ... ...
7653563366, 7,  ... ...
...         ...,  ... ...

For the second assignment:

Student id, Mark, extra fields, ...
4358975489, 6,  ... ...
2345234523, 8,  ... ...
7634565323, 4,  ... ...
7653563366, 5,  ... ...
...         ...,  ... ...

Desired output for the two doc for instance:

Student id, average, extra fields, ...
4358975489, 7.5,  ... ...
2345234523, 9,  ... ...
7634565323, 5.5,  ... ...
7653563366, 6,  ... ...
...         ...,  ... ...

the same for the last doc. I want to read these docs separately and for each student id to average the Mark.

Now, my code for reading one file is the following:

i_df1 = pandas.read_csv('first.csv')
i_df2 = pandas.read_csv('second.csv')
i_df3 = pandas.read_csv('third.csv')

print (o_df.keys())
for i, row in i_df1.iterrows():
    pdb.set_trace()

How can I process all three files simultaneously and extract the average grade?

Upvotes: 0

Views: 187

Answers (4)

Mayank Porwal
Mayank Porwal

Reputation: 34046

If you want to process all dfs together, you can do this:

df = df1.append([df2, df3]).groupby('Student id', as_index=False).mean()

OR:

If you want to do it simultaneously, you can use list comprehension with df.append and mean:

Below is the list of your dfs:

In [1220]: df_list = [i_df1, i_df2, i_df3]

You can simultaneously find average of each student in a file and store the output in another list:

In [1223]: df = [i.groupby('Student_id', as_index=False).mean() for i in df_list]

Upvotes: 1

Kim Rop
Kim Rop

Reputation: 142

using the data you gave assuming they get same marks three times

import pandas as pd
import numpy as np


data = [
    [4358975489, 9],
     [2345234523, 10],
    [7634565323, 7]]



data = np.array(data)
data = pd.DataFrame(data, columns=["student", "mark"])
data1 = pd.DataFrame(data, columns=["student", "mark"])
data2 = pd.DataFrame(data, columns=["student", "mark"])

std_maks = pd.concat([data, data1, data2]).groupby('student')
print(std_maks['mark'].mean())

Upvotes: 1

KenHBS
KenHBS

Reputation: 7164

You could also use the filenames and directly concatenate the dataframes together:

fnames = ["first.csv", "second.csv", "third.csv"]
df = pd.concat(pd.read_csv(fname) for fname in fnames)

df.groupby("Student id")["Mark"].mean()

Probably practically irrelevant, but perhaps nice to know anyway: This approach doesn't load the data into your memory twice, but only once.

Upvotes: 1

Mehdi Golzadeh
Mehdi Golzadeh

Reputation: 2583

Use pd.concat to concat 3 df:

 i_df1 = pandas.read_csv('first.csv')
 i_df2 = pandas.read_csv('second.csv')
 i_df3 = pandas.read_csv('third.csv')

 df = pd.concat([i_df1, i_df2, i_df3])
 df.groupby('Student id').agg({'Mark':'mean'})

Upvotes: 1

Related Questions