Reputation: 121
eno,ename,
101,'sam',
102,'bill',
eno,ename,
103,'jack',
eno,ename,
104,'pam',
I have a huge .csv file in which column names reappear after certain number of rows. is there a way in python to split such data into multiple files as soon as it encounter the "repeated column names"?
I would like the above data to be in 3 separate .csv files since the same column names appear 3 times.
Upvotes: 0
Views: 345
Reputation: 548
One way of doing it is to save the headers to a variable, and then when reading the file check if the current row matches the header. If it does, increment a counter that can be used to determine which file to write to.
import csv
HEADERS = next(csv.reader(open('data.csv')))
print(HEADERS)
with open('data.csv') as f:
reader = csv.reader(f)
file_name_counter = 0
for row in reader:
if row == HEADERS:
file_name_counter += 1
with open(f'data{file_name_counter}.csv', ('w' if row == HEADERS else "a"), newline="") as f:
writer = csv.writer(f)
writer.writerow(row)
NOTE: I believe the newline=""
argument is necessary on Windows, as otherwise csv.writer()
will add an extra new line between each entry.
Upvotes: 1
Reputation: 11657
I know you asked for Python, but there are some questions that just cry out for the power of AWK :)
awk '/eno,ename/{x="F"++i ".csv";}{print > x;}' input.csv
Upvotes: 2
Reputation: 1365
Challenging! Here's my solution. There is likely a more straightforward way to do this though.
with open("./file.csv", "r") as readfile:
file_number = 0
current_line_no = 0
tmpline = None
for line in readfile:
# count which file you're on. Also use write mode "W" if the first line. Else append.
with open(f"./writefile{file_number}.csv", ("w" if current_line_no == 0 else "a")) as writefile:
# check if the "headers" are appearing and if the current file has more than 1 line.
# Not sure if the header check is the best for your use case. Maybe regex is best here.
if current_line_no != 0 and ("eno" in line and "ename" in line):
file_number += 1 # increment to next file
current_line_no = 0 # reset file number
tmpline = line # remember the "current line". This needs to be added to next file.
continue # continue to next line in readfile
# if there is a templine from previous, add it to this as header.
if tmpline is not None:
writefile.write(tmpline)
tmpline = None
# write the line and increment to new line
writefile.write(line)
current_line_no += 1
I've tried to comment as best as possible. The code basically opens the files one by one as it loops through the lines of the readfile
. When it reads the contents it checks if the current line is a "header". Here I simply checked if "eno" and "ename" are in the line, but there is probably a better approach for your use case. If the current line is a header, then you need to close the current file and open a new one. Hopefully this helps!
Upvotes: 2