Reputation: 87
I have been struggling with this issue for hours now and I can't seem to figure it out. I would really appreciate it for any input that would help.
Background
I am trying to automate data manipulation for my research lab in school through python. From the experiment, a .csv
file containing 41 rows of data excluding header will be produced as seen below.
Sometimes, multiple runs of the same experiment exist and that will produce .csv
files with the same header, and taking an average of them is needed for accuracy. Something like this with the same number of rows and headers:
So far I was able to filter the basenames to only contain the .csv
files of the same parameters and have them added to a data frame. However, my issue is that I don't know how to continue to get an average.
My Current Code and output
Code:
import pandas as pd
import os
dir = "/Users/luke/Desktop/testfolder"
files = os.listdir(dir)
files_of_interests = {}
for filename in files:
if filename[-4:] == '.csv':
key = filename[:-5]
files_of_interests.setdefault(key, [])
files_of_interests[key].append(filename)
print(files_of_interests)
for key in files_of_interests:
stack_df = pd.DataFrame()
print(stack_df)
for filename in files_of_interests[key]:
stack_df = stack_df.append(pd.read_csv(os.path.join(dir, filename)))
print(stack_df)
Output:
Empty DataFrame
Columns: []
Index: []
Unnamed: 0 Wavelength S2c Wavelength.1 S2
0 0 1100 0.000342 1100 0.000304
1 1 1110 0.000452 1110 0.000410
2 2 1120 0.000468 1120 0.000430
3 3 1130 0.000330 1130 0.000306
4 4 1140 0.000345 1140 0.000323
.. ... ... ... ... ...
36 36 1460 0.002120 1460 0.001773
37 37 1470 0.002065 1470 0.001693
38 38 1480 0.002514 1480 0.002019
39 39 1490 0.002505 1490 0.001967
40 40 1500 0.002461 1500 0.001891
[164 rows x 5 columns]
Question Here!
So my question is, how do I get it to append towards the right individually for each S2c
and S2
?
Explanation:
With multiple .csv files with the same header names, when I append it to the list it just keeps stacking towards the bottom of the previous .csv
file which led to the [164 rows x 5 columns]
from the previous section. My original idea is to create a new data frame and only appending S2c
and S2
from each of those .csv
files such that instead of stacking on top of one another, it will keep appending them as new columns towards the right. Afterward, I can do some form of pandas column manipulation to have them added and divided by the number of runs (which are just the number of files, so len(files_of_interests[key])
under the second FOR loop).
What I have tried
I have tried creating an empty data frame and adding a column that is taken from np.arange(1100,1500,10)
using pd.DataFrame.from_records()
. And append S2c
and S2
to the data frame as I have described from the previous section. The same issue occurred, in addition to that, it produces a bunch of Nan values which I am not too well equipped to deal with even after searching further.
I have read up on multiple other questions posted here, many suggested using pd.concat
but since the answers are tailored to a different situation, I can't really replicate it nor do was I able to understand the documentation for it so I stopped pursuing this path.
Thank you in advance for your help!
Additional Info
I am using macOS and ATOM for the code.
The csv files can be found here!
github: https://github.com/teoyi/PROJECT-Automate-Research-Process
Trying out @zabop method
Code:
dflist = []
for key in files_of_interests:
for filename in files_of_interests[key]:
dflist.append(pd.read_csv(os.path.join(dir, filename)) )
concat = pd.concat(dflist, axis = 1)
concat.to_csv(dir + '/concat.csv')
Output:
Trying @SergeBallesta method
Code:
df = pd.concat([pd.read_csv(os.path.join(dir, filename))
for key in files_of_interests for filename in files_of_interests[key]])
df = df.groupby(['Unnamed: 0', 'Wavelength', 'Wavelength.1']).mean().reset_index()
df.to_csv(dir + '/try.csv')
print(df)
Output:
Upvotes: 3
Views: 374
Reputation: 149185
IIUC you have:
'Unnamed: '
and you would like to get the average values of the S2 and S2c column for the same Wavelength value.
This can be done simply with groupby
and mean
, but we first have to filter out all the unnecessay columns. It can be made through the index_col
and usecols
parameter of read_csv
:
...
print(files_of_interests)
# first concat the datasets:
dfs = [pd.read_csv(os.path.join(dir, filename), index_col=1,
usecols=lambda x: not x.startswith('Unnamed: '))
for key in files_of_interests for filename in files_of_interests[key]]
df = pd.concat(dfs).reset_index()
# then take the averages
df = df.groupby(['Wavelength', 'Wavelength.1']).mean().reset_index()
# reorder columns and add 1 to the index to have it to run from 1 to 41
df = df.reindex(columns=['Wavelength', 'S2c', 'Wavelength.1', 'S2'])
df.index += 1
If there are still unwanted columns in resulting df, this magic command will help to identify the original files having a weird struct:
import pprint
pprint.pprint([df.columns for df in files])
With the files from github testfolder, it gives:
[Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
Index(['Unnamed: 0', 'Unnamed: 0.1', 'Wavelength', 'S2c', 'Wavelength.1',
'S2'],
dtype='object'),
Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object')]
It makes clear that the fifth file as an additional columns.
Upvotes: 1
Reputation: 87
Turns out both @zabop and @SergeBallesta have provided me with valuable insights on to work on this issue through pandas.
What I wanted to have:
The respective S2c and S2 columns of each file within the key:value pairs to be merged into one .csv
file for further manipulation.
Remove redundant columns to only show a single column of Wavelength
that ranges from 1100 to 1500 with an increment of 10.
This requires the use of pd.concat
which was introduced by @zabop and @SergeBallesta as shown below:
for key in files_of_interests:
list = []
for filename in files_of_interests[key]:
list.append(pd.read_csv(os.path.join(dir,filename)))
df = pd.concat(list, axis = 1)
df = df.drop(['Unnamed: 0', 'Wavelength.1'], axis = 1)
print(df)
df.to_csv(os.path.join(dir + '/', f"{filename[:-5]}_master.csv"))
I had to use files_of_interests[key]
for it to be able to read the filenames and have pd.read_csv
to read the correct path. Other than that, I added axis = 1
to pd.concat
which allows it to be concatenated horizontally along with for loops to access the filenames correctly. (I have double-checked the values and they do match up with the respective .csv
files.)
The output to .csv
looks like this:
The only issue now is that groupby
as suggested by @SergeBallesta did not work as it returns ValueError: Grouper for 'Wavelength' not 1-dimensional
. I will be creating a new question for this if I make no progress by the end of the day.
Once again, a big thank you to @zabop and @SergeBallesta for giving this a try though my explanation wasn't too clear, their answers have definitely provided me with the much-needed insight of how pandas work.
Upvotes: 0
Reputation: 7932
If you have a list of dataframes, for example:
import pandas as pd
data = {'col_1': [3, 2, 1, 0], 'col_2': [3, 1, 2, 0]}
dflist = [pd.DataFrame.from_dict(data) for _ in range(5)]
You can do:
pd.concat(dflist,axis=1)
Which will look like:
If you want to append each column name with a number indicating which df
they came from, before concat
, do:
for index, df in enumerate(dflist):
df.columns = [col+'_'+str(index) for col in df.columns]
Then pd.concat(dflist,axis=1)
, resulting:
While I can't reproduce your file system & confirm that this works, to create the dflist
above from you files, something like this should work:
dflist = []
for key in files_of_interests:
print(stack_df)
for filename in files_of_interests[key]:
dflist.append( pd.read_csv(os.path.join(dir, filename)) )
Upvotes: 1