Reputation: 55
I'm currently in a project and I need to work with a lot of CSV files, which are filled with data something like this:
CSV1.csv
A B C D ...
1 1980 1 0.9 0.8
2 2003 0.9 0.8 0.2
3 1665 0.7 0.2 0.4
4 1982 0.6 1 0.2
...
CSV2.csv
A E F G ...
1 1665 1 0.4 1
2 1980 0.4 0.8 0.6
3 2003 0.1 0.3 0.9
4 1982 0.3 1 0.6
...
All of the CSV files have the same values in the A
column, but are disorganized. I am importing all the files like this:
path = r"/Users/.../folder/"
all_files = glob.glob(path + "/*.CSV")
all_csv = (pd.read_csv(f, sep=',') for f in all_files)
df_merged = pd.concat(all_csv, axis=1, ignore_index=False)
It gets merged, but the dataframe is disorganized.
This way is not correct to sort with df_merged.sort()
because there is not a column with the same order at A
. I know that I can manually import each one of the csv files and apply a .sort()
, but those are 394 csv files...
I feel like something like this might be applicable in a large import of CSV files, but I don't know how to call a code line before the dataframe combination gets made (all_csv
is a generator object).
P.S. at the end I execute to eliminate repeated A
columns:
df_merged = df_merged.loc[:, ~df_merged.columns.duplicated()]
Upvotes: 0
Views: 493
Reputation: 1388
Alignment can be obtained by setting A as the index.
Using a list of dataframes is not appealing as this can take a lot of memory.
import pandas as pd
path = r"/Users/.../folder/"
all_files = glob.glob(path + "/*.CSV")
df = pd.read_csv(all_files[0], sep=',').set_index('A')
for f in all_files[1:]:
dfs = pd.read_csv(f, sep=',').set_index('A')
df = pd.concat([df, dfs], axis=1)
functools.reduce
import pandas as pd
from functools import reduce
path = r"/Users/.../folder/"
all_files = glob.glob(path + "/*.CSV")
def_gen = (pd.read_csv(io.StringIO(fn), sep='\s+').set_index('A') for fn in all_files)
df = reduce(lambda df, d: pd.concat([df, d], axis=1), def_gen)
df
:
B C D E F G
A
1665 0.7 0.2 0.4 1.0 0.4 1.0
1980 1.0 0.9 0.8 0.4 0.8 0.6
1982 0.6 1.0 0.2 0.3 1.0 0.6
2003 0.9 0.8 0.2 0.1 0.3 0.9
Personally, I would take the easy path ("solution 1") and add some logging to identify where there will be an import error. Because in real world data is rarely clean and well formatted.
Upvotes: 0
Reputation:
Instead of using concat, you should merge each dataframe together:
df = all_csv[0]
for csv in all_csv[1:]:
df = df.merge(csv)
Output:
>>> df
A B C D E F G
0 1980 1.0 0.9 0.8 0.4 0.8 0.6
1 2003 0.9 0.8 0.2 0.1 0.3 0.9
2 1665 0.7 0.2 0.4 1.0 0.4 1.0
3 1982 0.6 1.0 0.2 0.3 1.0 0.6
Note: you need to make all_csv
a list instead of a generator:
all_csv = [pd.read_csv(f, sep=',') for f in all_files]
Upvotes: 1