Juank
Juank

Reputation: 55

How can I sort and concat a csv file in a dataframe

I'm currently in a project and I need to work with a lot of CSV files, which are filled with data something like this:

    CSV1.csv

      A      B     C     D    ...
    1 1980  1     0.9   0.8
    2 2003  0.9   0.8   0.2
    3 1665  0.7   0.2   0.4
    4 1982  0.6   1     0.2
   ...

     CSV2.csv
      A      E     F     G   ...
    1 1665  1     0.4   1
    2 1980  0.4   0.8   0.6
    3 2003  0.1   0.3   0.9
    4 1982  0.3   1     0.6
   ...
  

All of the CSV files have the same values in the A column, but are disorganized. I am importing all the files like this:

path = r"/Users/.../folder/"
all_files = glob.glob(path + "/*.CSV")
all_csv = (pd.read_csv(f, sep=',') for f in all_files)
df_merged   = pd.concat(all_csv, axis=1, ignore_index=False) 

It gets merged, but the dataframe is disorganized.

This way is not correct to sort with df_merged.sort() because there is not a column with the same order at A. I know that I can manually import each one of the csv files and apply a .sort(), but those are 394 csv files...

I feel like something like this might be applicable in a large import of CSV files, but I don't know how to call a code line before the dataframe combination gets made (all_csv is a generator object).


P.S. at the end I execute to eliminate repeated A columns:

df_merged = df_merged.loc[:, ~df_merged.columns.duplicated()]

Upvotes: 0

Views: 493

Answers (2)

hpchavaz
hpchavaz

Reputation: 1388

  1. Alignment can be obtained by setting A as the index.

  2. Using a list of dataframes is not appealing as this can take a lot of memory.

    • solution 1: Build the other dataframes in the loop
    import pandas as pd
    
    path = r"/Users/.../folder/"
    all_files = glob.glob(path + "/*.CSV")
    
    df = pd.read_csv(all_files[0], sep=',').set_index('A')
    for f in all_files[1:]:
        dfs = pd.read_csv(f, sep=',').set_index('A')
        df = pd.concat([df, dfs], axis=1)
    
    • solution 2: Keep the generator and use functools.reduce
    import pandas as pd
    from functools import reduce
    
    path = r"/Users/.../folder/"
    all_files = glob.glob(path + "/*.CSV")
    
    def_gen = (pd.read_csv(io.StringIO(fn), sep='\s+').set_index('A') for fn in all_files)
    df = reduce(lambda df, d: pd.concat([df, d], axis=1), def_gen)
    

    df:

            B    C    D    E    F    G
    A                                 
    1665  0.7  0.2  0.4  1.0  0.4  1.0
    1980  1.0  0.9  0.8  0.4  0.8  0.6
    1982  0.6  1.0  0.2  0.3  1.0  0.6
    2003  0.9  0.8  0.2  0.1  0.3  0.9
    

Personally, I would take the easy path ("solution 1") and add some logging to identify where there will be an import error. Because in real world data is rarely clean and well formatted.

Upvotes: 0

user17242583
user17242583

Reputation:

Instead of using concat, you should merge each dataframe together:

df = all_csv[0]
for csv in all_csv[1:]:
    df = df.merge(csv)

Output:

>>> df
      A    B    C    D    E    F    G
0  1980  1.0  0.9  0.8  0.4  0.8  0.6
1  2003  0.9  0.8  0.2  0.1  0.3  0.9
2  1665  0.7  0.2  0.4  1.0  0.4  1.0
3  1982  0.6  1.0  0.2  0.3  1.0  0.6

Note: you need to make all_csv a list instead of a generator:

all_csv = [pd.read_csv(f, sep=',') for f in all_files]

Upvotes: 1

Related Questions