zhl
zhl

Reputation: 3

Merge 10000 csv in different folders while keeping sorted

I have 10,000 folders (test1, test2,..., test10000), each folder contains the same five .csv files (real.csv, model.csv, run.csv, swe.csv, error.csv), but the values in each are different.

I need to combine each csv with the same filename into 1 csv, i.e. produce a concatenated .csv (e.g. real.csv) with all the data from the 10,000 .csv (real.csv) files. AND I need the concatenated .csv to be sorted, i.e. I need row 1 to be from test1, row 2 to be from test2,..., row 10,000 from test10000.

I have used the code from here as blueprint and edited it to sort. Merge multiple csv files with same name in 10 different subdirectory

import pandas as pd
import glob

concat_dir = '/home/zachary/workspace/lineartransfer/error/files_concat/'

files = pd.DataFrame([file for file in glob.glob("/home/zachary/workspace/lineartransfer/error/*/*")], columns=["fullpath"])

# Split the full path into directory and filename
files_split = files['fullpath'].str.rsplit("/", 1, expand=True).rename(columns={0: 'path', 1: 'filename'})

# Join these into one DataFrame
files = files.join(files_split)

# Iterate over unique filenames; read CSVs, concat DFs, save file
for f in files['filename'].unique():
    paths = files[files['filename'] == f]['fullpath'] # Get list of fullpaths from unique filenames
    dfs = [pd.read_csv(path, header=None) for path in sorted(paths)] # Get list of dataframes from CSV file paths
    concat_df = pd.concat(dfs) # Concat dataframes into one
    concat_df.to_csv(concat_dir + f) # Save dataframe

The above code works, but I get sorting of: 1 10 100 1000 10000 1001 1002 ... 102 1020 1021 ...

I need sorting: 1 2 3 ... 10000

Thanks in advance.

Upvotes: 0

Views: 86

Answers (1)

Colin Pearse
Colin Pearse

Reputation: 36

Here is an numeric-alpha sort which takes into account numbers and their values, even when they are embedded in a string.

from functools import cmp_to_key

def nasort(x, y):
    fx = re.sub(r'(\d+)', r'{:099d}', x)
    fy = re.sub(r'(\d+)', r'{:099d}', y)
    ax = map(int, re.sub(r'([^\d]+)', r' ', x).strip().split(' '))
    ay = map(int, re.sub(r'([^\d]+)', r' ', y).strip().split(' '))
    _x = fx.format(*ax)
    _y = fy.format(*ay)
    if   str(_x) > str(_y): return 1
    elif str(_x) < str(_y): return -1
    else: return 0

print (sorted(['file5', 'file2', 'file4', 'file1', 'file10']))
print (sorted(['file5', 'file2', 'file4', 'file1', 'file10'], key=cmp_to_key(nasort)))

The first line (output array) is the standard sort.
The second line (output array) is the new sort where file10 comes after file5. ['file1', 'file10', 'file2', 'file4', 'file5'] ['file1', 'file2', 'file4', 'file5', 'file10']

Upvotes: 1

Related Questions