npool
npool

Reputation: 121

split the data into separate file after encountering a column name

eno,ename,
101,'sam',
102,'bill',
eno,ename,
103,'jack',
eno,ename,
104,'pam',

I have a huge .csv file in which column names reappear after certain number of rows. is there a way in python to split such data into multiple files as soon as it encounter the "repeated column names"?

I would like the above data to be in 3 separate .csv files since the same column names appear 3 times.

Upvotes: 0

Views: 345

Answers (3)

fishcakes
fishcakes

Reputation: 548

One way of doing it is to save the headers to a variable, and then when reading the file check if the current row matches the header. If it does, increment a counter that can be used to determine which file to write to.

import csv

HEADERS = next(csv.reader(open('data.csv')))
print(HEADERS)
with open('data.csv') as f:
    reader = csv.reader(f)
    
    file_name_counter = 0
    for row in reader:
        if row == HEADERS:
            file_name_counter += 1
        
        with open(f'data{file_name_counter}.csv', ('w' if row == HEADERS else "a"), newline="") as f:
            writer = csv.writer(f)
            writer.writerow(row)

NOTE: I believe the newline="" argument is necessary on Windows, as otherwise csv.writer() will add an extra new line between each entry.

Upvotes: 1

Josh Friedlander
Josh Friedlander

Reputation: 11657

I know you asked for Python, but there are some questions that just cry out for the power of AWK :)

awk '/eno,ename/{x="F"++i ".csv";}{print > x;}' input.csv

Upvotes: 2

Challenging! Here's my solution. There is likely a more straightforward way to do this though.

with open("./file.csv", "r") as readfile:
    file_number = 0
    current_line_no = 0
    tmpline = None

    for line in readfile:

        # count which file you're on. Also use write mode "W" if the first line. Else append.
        with open(f"./writefile{file_number}.csv", ("w" if current_line_no == 0 else "a")) as writefile:

            # check if the "headers" are appearing and if the current file has more than 1 line.
            #  Not sure if the header check is the best for your use case. Maybe regex is best here.
            if current_line_no != 0 and ("eno" in line and "ename" in line):
                file_number += 1        # increment to next file
                current_line_no = 0     # reset file number
                tmpline = line          # remember the "current line". This needs to be added to next file.
                continue                # continue to next line in readfile
            
            # if there is a templine from previous, add it to this as header.
            if tmpline is not None:
                writefile.write(tmpline)
                tmpline = None 

            # write the line and increment to new line
            writefile.write(line)
            current_line_no += 1

I've tried to comment as best as possible. The code basically opens the files one by one as it loops through the lines of the readfile. When it reads the contents it checks if the current line is a "header". Here I simply checked if "eno" and "ename" are in the line, but there is probably a better approach for your use case. If the current line is a header, then you need to close the current file and open a new one. Hopefully this helps!

Upvotes: 2

Related Questions