Reading in 54 GB dataset with pandas

Question

Currently trying to get a sample of a very large dataset ~54 gb. However since I know that getting anything larger than 1 gb becomes very inefficient, I only want to read in the first 100k rows. This is what I have so far:

df_chunk = pd.read_csv(r'pol.csv', chunksize=1000000, engine='python')

    chunk_list = []  # append each chunk df here

    # Each chunk is in df format
    for chunk in df_chunk:
        # Once the data filtering is done, append the chunk to list
        chunk_list.append(chunk)

    # concat the list into dataframe
    df_concat = pd.concat(chunk_list)

However running this gives me this error:

 File "path", line 3121, in _get_lines new_rows.append(next(self.data)) _csv.Error: ',' expected after '"'

Changing the engine to C throws a parsing error, and then setting low_memory = False doesn't work with the python engine. Also setting setting error_bad_lines= True skips way to many rows from the dataset.

I just need a small chunk of the dataset to work with but its extremely hard to even get just that.

Reading in 54 GB dataset with pandas

Answers (1)

Related Questions