JPcodes
JPcodes

Reputation: 121

How to split the csv based on multiple columns

I am trying to split a csv into multiple files based on two column values. For example,

Source file:

Header1  Header2             Header3
Alpha    energy              0.1
Alpha    energy              0.34
Beta     energy_imbalance    0.66
Beta     energy              0.7
Beta     energy              0.1
Gamma    energy_imbalance    0.3

Expected output:

Outfile1:

Header1  Header2             Header3
Alpha    energy              0.1
Alpha    energy              0.34

Outfile2:

Header1  Header2             Header3
Beta     energy_imbalance    0.66

Outfile3:

Header1  Header2             Header3
Beta     energy              0.7
Beta     energy              0.1

Outfile4:

Header1  Header2             Header3
Gamma    energy_imbalance    0.3

The following is what I started with:

filein = open('test.csv')
csvin = csv.DictReader(filein)

outputs = {}
for row in csvin:
    primaryValue = row['Header1']
    secondaryValue = row['Header2']
    if primaryValue not in outputs:
        fileout = open('{}_{}.csv'.format(primaryValue,secondaryValue),'w')
        dw = csv.DictWriter(fileout, fieldnames=csvin.fieldnames)
        dw.writeheader()
        outputs[primaryValue] = fileout, dw
    outputs[primaryValue][1].writerow(row)

for fileout, _ in outputs.values():
    fileout.close()

I was able to split the file based on column = Header1, however I am not sure how to proceed further.

Upvotes: 3

Views: 1085

Answers (3)

c_48
c_48

Reputation: 223

Using pandas df.groupby() is another option to split a csv based on multiple column values.

Working example:

import pandas as pd
df = pd.read_csv('test.csv')
def df_to_grouped_csv(df):
    df_group = df.groupby(['Header1', 'Header2'])
    for name, group in df_group:
        outfile = '_'.join(name) + '.csv'
        group.to_csv(outfile, index=False)

Output:

Alpha_energy.csv
  Header1 Header2  Header3
0   Alpha  energy     0.10
1   Alpha  energy     0.34
Beta_energy.csv
  Header1 Header2  Header3
3    Beta  energy      0.7
4    Beta  energy      0.1
Beta_energy_imbalance.csv
  Header1           Header2  Header3
2    Beta  energy_imbalance     0.66
Gamma_energy_imbalance.csv
  Header1           Header2  Header3
5   Gamma  energy_imbalance      0.3

In terms of performance this should show an improvement as compared to the csv.DictWriter approach (particularly for large files). But it does require importing pandas.

Performance:

Larger file (500,000 rows)
In [1]: %timeit df_to_grouped_csv()
865 ms ± 36.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [2]: %timeit csv_DictWriter_approach()
2.71 s ± 40.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 1

martineau
martineau

Reputation: 123531

Here's how to implement in a manner along the lines of what @Barmar's suggested (i.e. using the two column values as a dictionary key). As shown, the key is used to look-up which csv.DictWriter instance that gets used to write the row — creating new ones as necessary. It also closes all the associated files that were opened at the end by using a separate list that keesp track of those.

import csv

infile_name = 'multicol_test.csv'

with open(infile_name, newline='') as infile:
    csv_writers = {}
    files = []
    reader = csv.DictReader(infile)

    for row in reader:
        if (key := f"{row['Header1']}_{row['Header2']}") not in csv_writers:
            # Create the csv file and a corresponding DictWriter.
            outfile_name = f'{key}.csv'
            fileout = open(outfile_name, 'w', newline='')
            files.append(fileout)  # To have it closed later.
            writer = csv.DictWriter(fileout, fieldnames=reader.fieldnames)
            writer.writeheader()
            csv_writers[key] = writer

        # Write the line to corresponding csv writer.
        csv_writers[key].writerow(row)

    # Close all CSV output files.
    for f in files:
        f.close()

Applied to the sample input file, this would produce the following csv output files:

Alpha_energy.csv
Beta_energy.csv
Beta_energy_imbalance.csv
Gamma_energy_imbalance.csv

with the data in them you expect.

Upvotes: 1

Charif DZ
Charif DZ

Reputation: 14751

Here try this:

csvin = csv.DictReader(filein)
csv_files = {}
files = []

for row in csvin:
    key = (row['Header1'], row['Header2'])
    if key not in csv_files:
        # create the csv file
        fileout = open('{}_{}.csv'.format(*key), 'w')
        dw = csv.DictWriter(fileout, fieldnames=csvin.fieldnames)
        dw.writeheader()
        csv_files[key] = dw
        files.append(fileout)  # to close them later

    # write the line into to corresponding csv writer
    csv_files[key].writerow(row)

# close all files
for f in files: f.close()

Upvotes: 1

Related Questions