Sebastian Goslin
Sebastian Goslin

Reputation: 497

Reading in 54 GB dataset with pandas

Currently trying to get a sample of a very large dataset ~54 gb. However since I know that getting anything larger than 1 gb becomes very inefficient, I only want to read in the first 100k rows. This is what I have so far:

df_chunk = pd.read_csv(r'pol.csv', chunksize=1000000, engine='python')

    chunk_list = []  # append each chunk df here

    # Each chunk is in df format
    for chunk in df_chunk:
        # Once the data filtering is done, append the chunk to list
        chunk_list.append(chunk)

    # concat the list into dataframe
    df_concat = pd.concat(chunk_list)

However running this gives me this error:

 File "path", line 3121, in _get_lines new_rows.append(next(self.data)) _csv.Error: ',' expected after '"'

Changing the engine to C throws a parsing error, and then setting low_memory = False doesn't work with the python engine. Also setting setting error_bad_lines= True skips way to many rows from the dataset.

I just need a small chunk of the dataset to work with but its extremely hard to even get just that.

Upvotes: 0

Views: 160

Answers (1)

ilmiacs
ilmiacs

Reputation: 2576

There seems to be some formatting issues in your big CSV. I suggest you first make a smaller file with just a fraction of the data and inspect that manually for the formatting issues: They need to be fixed in order to be parsed successfully. To extract some portion, do

with open('pol.csv') as f:
    with open('pol_part.csv','w') as g:
        for i in range(1000): # replace 1000 with 100000 when ready
            g.write(f.readline())

Upvotes: 2

Related Questions