TempleGuard527
TempleGuard527

Reputation: 693

Quickly Remove Header from Large .csv Files

My question is not how to open a .csv file, detect which rows I want to omit, and write a new .csv file with my desired lines. I'm already doing that successfully:

def sanitize(filepath): #Removes header information, leaving only column names and data.  Outputs "sanitized" file.
    with open(filepath) as unsan, open(dirname + "/" + newname + '.csv', 'w', newline='') as san:
        writer = csv.writer(san)
        line_count = 0
        headingrow = 0
        datarow = 0
        safety = 1
        for row in csv.reader(unsan, delimiter=','):

            #Detect data start
            if "DATA START" in str(row):
                safety = 0
                headingrow = line_count + 1
                datarow = line_count + 4

            #Detect data end
            if "DATA END" in str(row):
                safety = 1

            #Write data
            if safety == 0:
                if line_count == headingrow or line_count >= datarow:
                    writer.writerow(row)

            line_count += 1

I have .csv data files that are megabytes, sometimes gigabytes (up to 4Gb) in size. Out of 180,000 lines in each file, I only need to omit about 50 lines.

Example pseudo-data (rows I want to keep are indented):

[Header Start]  
...48 lines of header data...  
[Header End]  
Blank Line  
[Data Start]  
     Row with Column Names  
Column Units  
Column Variable Type  
     ...180,000 lines of data...

I understand that I can't edit a .csv file as I iterate over it (Learned here: How to Delete Rows CSV in python). Is there a quicker way to remove the header information from the file, like maybe writing the remaining 180,000 lines as a block instead of iterating through and writing each line?

Maybe one solution would be to append all the data rows to a list of lists and then use writer.writerows(list of lists) instead of writing them one at a time (Batch editing of csv files with Python, https://docs.python.org/3/library/csv.html)? However, wouldn't that mean I'm loading essentially the whole file (up to 4Gb) into my RAM?

UPDATE:
I've got a pandas import working, but when I time it, it takes about twice as long as the code above. Specifically, the to_csv portion takes about 10s for a 26Mb file.

import csv, pandas as pd
filepath = r'input'

with open(filepath) as unsan:
    line_count = 0
    headingrow = 0
    datarow = 0
    safety = 1

    row_count = sum(1 for row in csv.reader(unsan, delimiter=','))

    for row in csv.reader(unsan, delimiter=','):

        #Detect data start
        if "DATA START" in str(row):
            safety = 0
            headingrow = line_count + 1
            datarow = line_count + 4

        #Write data
        if safety == 0:
            if line_count == headingrow:
                colnames = row
                line_count +=1
                break

        line_count += 1

badrows = [*range(0, 55, 1),row_count - 1]
df = pd.read_csv(filepath, names=[*colnames], skiprows=[*badrows], na_filter=False)

df.to_csv (r'output', index = None, header=True)

Here's the research I've done:
Deleting rows with Python in a CSV file
https://intellipaat.com/community/18827/how-to-delete-only-one-row-in-csv-with-python
https://www.reddit.com/r/learnpython/comments/7tzbjm/python_csv_cleandelete_row_function_doesnt_work/
https://nitratine.net/blog/post/remove-columns-in-a-csv-file-with-python/
Delete blank rows from CSV?

Upvotes: 2

Views: 4827

Answers (2)

gens
gens

Reputation: 1012

To avoid editing the file, and read the file with headers straight into Python and then into Pandas, you can wrap the file in your own file-like object.

Given an input file called myfile.csv with this content:

HEADER
HEADER
HEADER
HEADER
HEADER
HEADER

now, some, data
1,2,3
4,5,6
7,8,9

You can read that file in directly using a wrapper class:

import io

class HeaderSkipCsv(io.TextIOBase):
    def __init__(self, filename):
        """ create an iterator from the filename """
        self.data = self.yield_csv(filename)

    def readable(self):
        """ here for compatibility """
        return True

    def yield_csv(self, filename):
        """ open filename and read past the first empty line
        Then yield characters one by one. This reads just one
        line at a time in memory
        """
        with open(filename) as f:
            for line in f:
                if line.strip() == "":
                    break
            for line in f:
                for char in line:
                    yield char

    def read(self, n=None):
        """ called by Pandas with some 'n', this returns
        the next 'n' characters since the last read as a string
        """
        data = ""
        for i in range(n):
            try:
                data += next(self.data)
            except StopIteration:
                break
        return data

WANT_PANDAS=True #set to False to just write file

if WANT_PANDAS:
    import pandas as pd
    df = pd.read_csv(HeaderSkipCsv('myfile.csv'))
    print(df.head(5))
else:
    with open('myoutfile.csv', 'w') as fo:
        with HeaderSkipCsv('myfile.csv') as fi:
            c = fi.read(1024)
            while c:
                fo.write(c)
                c = fi.read(1024)

which outputs:

   now   some   data
0    1      2      3
1    4      5      6
2    7      8      9

Because Pandas allows any file-like object, we can provide our own! Pandas calls read on the HeaderSkipCsv object as it would on any file object. Pandas just cares about reading valid csv data from a file object when it calls read on it. Rather than providing Pandas with a clean file, we provide it with a file-like object that filters out the data Pandas does not like (i.e. the headers).

The yield_csv generator iterates over the file without reading it in, so only as much data as Pandas requests is loaded into memory. The first for loop in yield_csv advances f to beyond the first empty line. f represents a file pointer and is not reset at the end of a for loop while the file remains open. Since the second for loop receives f under the same with block, it starts consuming at the start of the csv data, where the first for loop left it.

Another way of writing the first for loop would be

next((line for line in f if line.isspace()), None)

which is more explicit about advancing the file pointer, but arguably harder to read.

Because we skip the lines up to and including the empty line, Pandas just gets the valid csv data. For the headers, no more than one line is ever loaded.

Upvotes: 1

gens
gens

Reputation: 1012

If it is not important that the file is read in Python, or with a CSV reader/writer, you can use other tools. On *nix you can use sed:

sed -n '/DATA START/,/DATA END/p' myfile.csv > headerless.csv

This will be very fast for millions of lines.

perl is more multi-platform:

perl -F -lane "print if /DATA START/ .. /DATA END/;" myfile.csv

Upvotes: 1

Related Questions