How can I sort and concat a csv file in a dataframe

Question

I'm currently in a project and I need to work with a lot of CSV files, which are filled with data something like this:

    CSV1.csv

      A      B     C     D    ...
    1 1980  1     0.9   0.8
    2 2003  0.9   0.8   0.2
    3 1665  0.7   0.2   0.4
    4 1982  0.6   1     0.2
   ...

     CSV2.csv
      A      E     F     G   ...
    1 1665  1     0.4   1
    2 1980  0.4   0.8   0.6
    3 2003  0.1   0.3   0.9
    4 1982  0.3   1     0.6
   ...

All of the CSV files have the same values in the A column, but are disorganized. I am importing all the files like this:

path = r"/Users/.../folder/"
all_files = glob.glob(path + "/*.CSV")
all_csv = (pd.read_csv(f, sep=',') for f in all_files)
df_merged   = pd.concat(all_csv, axis=1, ignore_index=False)

It gets merged, but the dataframe is disorganized.

This way is not correct to sort with df_merged.sort() because there is not a column with the same order at A. I know that I can manually import each one of the csv files and apply a .sort(), but those are 394 csv files...

I feel like something like this might be applicable in a large import of CSV files, but I don't know how to call a code line before the dataframe combination gets made (all_csv is a generator object).

P.S. at the end I execute to eliminate repeated A columns:

df_merged = df_merged.loc[:, ~df_merged.columns.duplicated()]

hpchavaz · Accepted Answer

Alignment can be obtained by setting A as the index.

Using a list of dataframes is not appealing as this can take a lot of memory.

solution 1: Build the other dataframes in the loop

import pandas as pd

path = r"/Users/.../folder/"
all_files = glob.glob(path + "/*.CSV")

df = pd.read_csv(all_files[0], sep=',').set_index('A')
for f in all_files[1:]:
    dfs = pd.read_csv(f, sep=',').set_index('A')
    df = pd.concat([df, dfs], axis=1)

solution 2: Keep the generator and use functools.reduce

import pandas as pd
from functools import reduce

path = r"/Users/.../folder/"
all_files = glob.glob(path + "/*.CSV")

def_gen = (pd.read_csv(io.StringIO(fn), sep='\s+').set_index('A') for fn in all_files)
df = reduce(lambda df, d: pd.concat([df, d], axis=1), def_gen)

df:

        B    C    D    E    F    G
A                                 
1665  0.7  0.2  0.4  1.0  0.4  1.0
1980  1.0  0.9  0.8  0.4  0.8  0.6
1982  0.6  1.0  0.2  0.3  1.0  0.6
2003  0.9  0.8  0.2  0.1  0.3  0.9

Personally, I would take the easy path ("solution 1") and add some logging to identify where there will be an import error. Because in real world data is rarely clean and well formatted.

How can I sort and concat a csv file in a dataframe

Answers (2)

Related Questions