Reputation: 693
My question is not how to open a .csv file, detect which rows I want to omit, and write a new .csv file with my desired lines. I'm already doing that successfully:
def sanitize(filepath): #Removes header information, leaving only column names and data. Outputs "sanitized" file.
with open(filepath) as unsan, open(dirname + "/" + newname + '.csv', 'w', newline='') as san:
writer = csv.writer(san)
line_count = 0
headingrow = 0
datarow = 0
safety = 1
for row in csv.reader(unsan, delimiter=','):
#Detect data start
if "DATA START" in str(row):
safety = 0
headingrow = line_count + 1
datarow = line_count + 4
#Detect data end
if "DATA END" in str(row):
safety = 1
#Write data
if safety == 0:
if line_count == headingrow or line_count >= datarow:
writer.writerow(row)
line_count += 1
I have .csv data files that are megabytes, sometimes gigabytes (up to 4Gb) in size. Out of 180,000 lines in each file, I only need to omit about 50 lines.
Example pseudo-data (rows I want to keep are indented):
[Header Start]
...48 lines of header data...
[Header End]
Blank Line
[Data Start]
Row with Column Names
Column Units
Column Variable Type
...180,000 lines of data...
I understand that I can't edit a .csv file as I iterate over it (Learned here: How to Delete Rows CSV in python). Is there a quicker way to remove the header information from the file, like maybe writing the remaining 180,000 lines as a block instead of iterating through and writing each line?
Maybe one solution would be to append all the data rows to a list of lists and then use writer.writerows(list of lists)
instead of writing them one at a time (Batch editing of csv files with Python, https://docs.python.org/3/library/csv.html)? However, wouldn't that mean I'm loading essentially the whole file (up to 4Gb) into my RAM?
UPDATE:
I've got a pandas import working, but when I time it, it takes about twice as long as the code above. Specifically, the to_csv portion takes about 10s for a 26Mb file.
import csv, pandas as pd
filepath = r'input'
with open(filepath) as unsan:
line_count = 0
headingrow = 0
datarow = 0
safety = 1
row_count = sum(1 for row in csv.reader(unsan, delimiter=','))
for row in csv.reader(unsan, delimiter=','):
#Detect data start
if "DATA START" in str(row):
safety = 0
headingrow = line_count + 1
datarow = line_count + 4
#Write data
if safety == 0:
if line_count == headingrow:
colnames = row
line_count +=1
break
line_count += 1
badrows = [*range(0, 55, 1),row_count - 1]
df = pd.read_csv(filepath, names=[*colnames], skiprows=[*badrows], na_filter=False)
df.to_csv (r'output', index = None, header=True)
Here's the research I've done:
Deleting rows with Python in a CSV file
https://intellipaat.com/community/18827/how-to-delete-only-one-row-in-csv-with-python
https://www.reddit.com/r/learnpython/comments/7tzbjm/python_csv_cleandelete_row_function_doesnt_work/
https://nitratine.net/blog/post/remove-columns-in-a-csv-file-with-python/
Delete blank rows from CSV?
Upvotes: 2
Views: 4827
Reputation: 1012
To avoid editing the file, and read the file with headers straight into Python and then into Pandas, you can wrap the file in your own file-like object.
Given an input file called myfile.csv
with this content:
HEADER
HEADER
HEADER
HEADER
HEADER
HEADER
now, some, data
1,2,3
4,5,6
7,8,9
You can read that file in directly using a wrapper class:
import io
class HeaderSkipCsv(io.TextIOBase):
def __init__(self, filename):
""" create an iterator from the filename """
self.data = self.yield_csv(filename)
def readable(self):
""" here for compatibility """
return True
def yield_csv(self, filename):
""" open filename and read past the first empty line
Then yield characters one by one. This reads just one
line at a time in memory
"""
with open(filename) as f:
for line in f:
if line.strip() == "":
break
for line in f:
for char in line:
yield char
def read(self, n=None):
""" called by Pandas with some 'n', this returns
the next 'n' characters since the last read as a string
"""
data = ""
for i in range(n):
try:
data += next(self.data)
except StopIteration:
break
return data
WANT_PANDAS=True #set to False to just write file
if WANT_PANDAS:
import pandas as pd
df = pd.read_csv(HeaderSkipCsv('myfile.csv'))
print(df.head(5))
else:
with open('myoutfile.csv', 'w') as fo:
with HeaderSkipCsv('myfile.csv') as fi:
c = fi.read(1024)
while c:
fo.write(c)
c = fi.read(1024)
which outputs:
now some data
0 1 2 3
1 4 5 6
2 7 8 9
Because Pandas allows any file-like object, we can provide our own! Pandas calls read
on the HeaderSkipCsv
object as it would on any file object. Pandas just cares about reading valid csv data from a file object when it calls read
on it. Rather than providing Pandas with a clean file, we provide it with a file-like object that filters out the data Pandas does not like (i.e. the headers).
The yield_csv
generator iterates over the file without reading it in, so only as much data as Pandas requests is loaded into memory. The first for
loop in yield_csv
advances f
to beyond the first empty line. f
represents a file pointer and is not reset at the end of a for
loop while the file remains open. Since the second for
loop receives f
under the same with
block, it starts consuming at the start of the csv data, where the first for
loop left it.
Another way of writing the first for
loop would be
next((line for line in f if line.isspace()), None)
which is more explicit about advancing the file pointer, but arguably harder to read.
Because we skip the lines up to and including the empty line, Pandas just gets the valid csv data. For the headers, no more than one line is ever loaded.
Upvotes: 1
Reputation: 1012
If it is not important that the file is read in Python, or with a CSV reader/writer, you can use other tools. On *nix you can use sed:
sed -n '/DATA START/,/DATA END/p' myfile.csv > headerless.csv
This will be very fast for millions of lines.
perl
is more multi-platform:
perl -F -lane "print if /DATA START/ .. /DATA END/;" myfile.csv
Upvotes: 1