bo_
bo_

Reputation: 81

ParserError: Error tokenizing data. C error: EOF inside string starting at row 110994 1 ​

I have a large file in CSV,but the result turn to that Error tokenizing data.

import glob  
import pandas as pd

path = "/Users/LAI/Downloads/learn/engagement_data"  
all_files = glob.glob(path + "/*.csv")  
print(all_files)

all_csv = [ ]  
for filename in all_files:  
    df = pd.read_csv(filename, index_col=None, header=0, sep=',')  
    all_csv.append(df)

engagement_df = pd.concat(li, axis=0, ignore_index=True)

picture of all_files result
here is the result

Upvotes: 1

Views: 5444

Answers (1)

randomdatascientist
randomdatascientist

Reputation: 611

There's probably an error in one of the CSV files you are reading.
Try using a print statement to figure out which file it is:

import glob  
import pandas as pd

path = "/Users/LAI/Downloads/learn/engagement_data"  
all_files = glob.glob(path + "/*.csv")  
print(all_files)

all_csv = [ ]  
for filename in all_files:
    try:  
        df = pd.read_csv(filename, index_col=None, header=0, sep=',')  
        all_csv.append(df)
    except Exception as e:
        print(f"Problem file: {filename} caused Exception: {e}")
        raise

engagement_df = pd.concat(li, axis=0, ignore_index=True)

Alternatively you can try changing the parser "engine" to the Python engine (as documented in this blog):

import glob  
import pandas as pd

path = "/Users/LAI/Downloads/learn/engagement_data"  
all_files = glob.glob(path + "/*.csv")  
print(all_files)

all_csv = [ ]  
for filename in all_files:  
    df = pd.read_csv(filename, index_col=None, header=0, sep=',', engine='python')  
    all_csv.append(df)

engagement_df = pd.concat(li, axis=0, ignore_index=True)

But it would be better practice to find the problematic CSV file and fix it. You could also combine the two solutions with something like:

import glob  
import pandas as pd

path = "/Users/LAI/Downloads/learn/engagement_data"  
all_files = glob.glob(path + "/*.csv")  
print(all_files)

all_csv = [ ]  
for filename in all_files:
    try:  
        df = pd.read_csv(filename, index_col=None, header=0, sep=',')  
        all_csv.append(df)
    except pd.errors.ParserError as e:
        df = pd.read_csv(filename, index_col=None, header=0, sep=',', engine='python')
        all_csv.append(df)
        print(f"Problem file: {filename} caused Exception: {e}")
        pass

engagement_df = pd.concat(li, axis=0, ignore_index=True)

Or simply skip that file if it's OK to be missing data:

import glob  
import pandas as pd

path = "/Users/LAI/Downloads/learn/engagement_data"  
all_files = glob.glob(path + "/*.csv")  
print(all_files)

all_csv = [ ]  
for filename in all_files:
    try:  
        df = pd.read_csv(filename, index_col=None, header=0, sep=',')  
        all_csv.append(df)
    except pd.errors.ParserError as e:
        print(f"Problem file: {filename} caused Exception: {e}")
        pass

engagement_df = pd.concat(li, axis=0, ignore_index=True)

Upvotes: 1

Related Questions