Reputation: 7441

pandas.parser.CParserError: Error tokenizing data

I'm trying to use pandas to manipulate a .csv file but I get this error:

pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 12

I have tried to read the pandas docs, but found nothing.

My code is simple:

path = 'GOOG Key Ratios.csv'
#print(open(path).read())
data = pd.read_csv(path)

How can I resolve this? Should I use the csv module or another language?

Upvotes: 711

Answers (30)

elPastor

Reputation: 8996

I've had this problem a few times myself. Almost every time, the reason is that the file I was attempting to open was not a properly saved CSV to begin with. And by "properly", I mean each row had the same number of separators or columns.

Typically it happened because I had opened the CSV in Excel then improperly saved it. Even though the file extension was still .csv, the pure CSV format had been altered.

Any file saved with pandas to_csv will be properly formatted and shouldn't have that issue. But if you open it with another program, it may change the structure.

Upvotes: 13

Phillip Long

Reputation: 51

I had this problem when dealing with a .csv file that I had generated (let's call it data.csv) consisting of a bunch of strings I had scraped from another dataset. Using:

df = pd.read_csv("data.csv", header = 0, sep = ",")

I realized the problem was with the data itself.

The first problem was that some of these strings contained commas themselves, which, to no surprise, confused pandas and yielded this error (panda would find more fields than expected based off of the header). I explicitly removed these commas when generating data.csv.
The other major problem was that some of these strings contained newline (\n) characters, which again threw pandas off. For example, according to the header, each line should have 10 fields, but when the 5th field contains a newline, this yields two lines with 5 and 6 fields, respectively, instead of one line with the correct amount of fields. This obviously confuses pandas as well, causing the error, so I also explicitly removed any instances of \n in generating data.csv.

Hope this helps.

Upvotes: 0

banderlog013

Reputation: 2503

Let's imagine, that we have some crappy CSV file and we do not want to drop error lines, because we have plenty of them. In that case, we can drop unnecessary columns, which causes error and load CSV after that:

# in that case, some rows in crappy.csv have more than 3 columns
# and we want only 3 of them
cat crappy.csv | cut -d',' --fields=1,2,3 > good.csv

Upvotes: 0

richie

Reputation: 18648

you could also try;

data = pd.read_csv('file1.csv', on_bad_lines='skip')

Do note that this will cause the offending lines to be skipped. If you don't expect many bad lines and want to (at least) know their amount and IDs, use on_bad_lines='warn'. For advanced handling of bads, you can pass a callable.

Edit

For Pandas < 1.3.0 try

data = pd.read_csv("file1.csv", error_bad_lines=False)

as per pandas API reference.

Upvotes: 1063

Syed Shaharyaar Hussain

Reputation: 679

I also faced the same error:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 16 fields in line 1680328, saw 17

In my case after certain number of rows "N" then number of columns permanently changes from 16 to 17 due to which this error occurs. The workaround in this situation is to explicitly specify the nrows by setting it equal to N-2 while reading the csv into pandas Dataframe as follows:

df = pd.read_csv(<your_file_name>, nrows=1680326)

In the error, my N=1680328, therefore I set nrows to N-2 which is 1680326. This solved my issue in the scenario where number of columns change permanently after certain number of rows.

Upvotes: 0

Espinola Francisco javier

Reputation: 23

Use first dos2unix to clean the file. after that optional you can add the colNames

Upvotes: 0

Toms Code

Reputation: 1699

I was just missing a comma after the last column title. When I added one in it fixed the issue. Hope it helps someone.

original:

col1, col2, col3

fixed version:

col1, col2, col3,

Upvotes: 0

reggie

Reputation: 3674

Check if you are loading the csv with the correct separator.

df = pd.read_csv(csvname, header=0, sep=",")

Upvotes: 3

Yassin El Jakani

Reputation: 54

In my case the problem was a pandas version, so pandas 1.3.5 works like a charm.

Upvotes: 0

Anass Lahrech

Reputation: 121

You can use :

pd.read_csv("mycsv.csv", delimiter=";")

Pandas 1.4.4

It can be the delimiter of your file, open it as a text file, lookup for the delimiter. Then you will have columns that can be empty and unamed because of the rows that contains way too many delimiters.

Therefore, you can handle them with pandas and checking for values. For me, it's better than skipping lines in my case.

Upvotes: 3

Sachin

Reputation: 1704

I came across multiple solutions for this issue. Lot's of folks have given the best explanation for the answers also. But for the beginners I think below two methods will be enough :

import pandas as pd

#Method 1

data = pd.read_csv('file1.csv', error_bad_lines=False)
#Note that this will cause the offending lines to be skipped.

#Method 2 using sep

data = pd.read_csv('file1.csv', sep='\t')

Upvotes: 5

Abu Bakar Siddik

Reputation: 459

The issue is with the delimiter. Find what kind of delimiter is used in your data and specify it like below:

data = pd.read_csv('some_data.csv', sep='\t')

Upvotes: 4

william_grisaitis

Reputation: 5931

It might be an issue with

the delimiters in your data
the first row, as @TomAugspurger noted

To solve it, try specifying the sep and/or header arguments when calling read_csv. For instance,

df = pandas.read_csv(filepath, sep='delimiter', header=None)

In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Thus saith the docs: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indices for each field {0,1,2,...}.

According to the docs, the delimiter thing should not be an issue. The docs say that "if sep is None [not specified], will try to automatically determine this." I however have not had good luck with this, including instances with obvious delimiters.

Another solution may be to try auto detect the delimiter

# use the first 2 lines of the file to detect separator
temp_lines = csv_file.readline() + '\n' + csv_file.readline()
dialect = csv.Sniffer().sniff(temp_lines, delimiters=';,')

# remember to go back to the start of the file for the next time it's read
csv_file.seek(0) 

df = pd.read_csv(csv_file, sep=dialect.delimiter)

Upvotes: 215

Shubham Chauhan

Reputation: 129

This looks ugly but you will have your dataframe

import re
path = 'GOOG Key Ratios.csv'

try:
    data = pd.read_csv(path)
except Exception as e:
    val = re.findall('tokenizing.{1,100}\s*Expected\s*(\d{1,2})\s*',str(e),re.I)
    data = pd.read_csv(path, skiprows=int(val[0])-1)

Upvotes: 1

piseynir

Reputation: 225

Sometimes in a cell there is a comma ",". Due to that pandas can' t read it. Try delimiter with ";"

df = pd.read_csv(r'yourpath', delimiter=";")

Upvotes: 3

Manodhya Opallage

Reputation: 301

You can try;

data = pd.read_csv('file1.csv', sep='\t')

Upvotes: 18

Laurent T

Reputation: 308

I believe the solutions,

,engine='python'
, error_bad_lines = False

will be good if it is dummy columns and you want to delete it. In my case, the second row really had more columns and I wanted those columns to be integrated and to have the number of columns = MAX(columns).

Please refer to the solution below that I could not read anywhere:

try:
    df_data = pd.read_csv(PATH, header = bl_header, sep = str_sep)
except pd.errors.ParserError as err:
    str_find = 'saw '
    int_position = int(str(err).find(str_find)) + len(str_find)
    str_nbCol = str(err)[int_position:]
    l_col = range(int(str_nbCol))
    df_data = pd.read_csv(PATH, header = bl_header, sep = str_sep, names = l_col)

Upvotes: 6

Mihai.Mehe

Reputation: 504

In my case the separator was not the default "," but Tab.

pd.read_csv(file_name.csv, sep='\\t',lineterminator='\\r', engine='python', header='infer')

Note: "\t" did not work as suggested by some sources. "\\t" was required.

Upvotes: 5

amran hossen

Reputation: 400

Error tokenizing data. C error: Expected 2 fields in line 3, saw 12

The error gives a clue to solve the problem " Expected 2 fields in line 3, saw 12", saw 12 means length of the second row is 12 and first row is 2.

When you have data like the one shown below, if you skip rows then most of the data will be skipped

data = """1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4"""

If you dont want to skip any rows do the following

#First lets find the maximum column for all the rows
with open("file_name.csv", 'r') as temp_f:
    # get No of columns in each line
    col_count = [ len(l.split(",")) for l in temp_f.readlines() ]

### Generate column names  (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(max(col_count))] 

import pandas as pd
# inside range set the maximum value you can see in "Expected 4 fields in line 2, saw 8"
# here will be 8 
data = pd.read_csv("file_name.csv",header = None,names=column_names )

Use range instead of manually setting names as it will be cumbersome when you have many columns.

Additionally you can fill up the NaN values with 0, if you need to use even data length. Eg. for clustering (k-means)

new_data = data.fillna(0)

Upvotes: 10

Brian

Reputation: 13593

In my case, it is because the format of the first and last two lines of the csv file is different from the middle content of the file.

So what I do is open the csv file as a string, parse the content of the string, then use read_csv to get a dataframe.

import io
import pandas as pd

file = open(f'{file_path}/{file_name}', 'r')
content = file.read()

# change new line character from '\r\n' to '\n'
lines = content.replace('\r', '').split('\n')

# Remove the first and last 2 lines of the file
# StringIO can be considered as a file stored in memory
df = pd.read_csv(StringIO("\n".join(lines[2:-2])), header=None)

Upvotes: 1

KareemJ

Reputation: 804

As far as I can tell, and after taking a look at your file, the problem is that the csv file you're trying to load has multiple tables. There are empty lines, or lines that contain table titles. Try to have a look at this Stackoverflow answer. It shows how to achieve that programmatically.

Another dynamic approach to do that would be to use the csv module, read every single row at a time and make sanity checks/regular expressions, to infer if the row is (title/header/values/blank). You have one more advantage with this approach, that you can split/append/collect your data in python objects as desired.

The easiest of all would be to use pandas function pd.read_clipboard() after manually selecting and copying the table to the clipboard, in case you can open the csv in excel or something.

Irrelevant:

Additionally, irrelevant to your problem, but because no one made mention of this: I had this same issue when loading some datasets such as seeds_dataset.txt from UCI. In my case, the error was occurring because some separators had more whitespaces than a true tab \t. See line 3 in the following for instance

14.38   14.21   0.8951  5.386   3.312   2.462   4.956   1
14.69   14.49   0.8799  5.563   3.259   3.586   5.219   1
14.11   14.1    0.8911  5.42    3.302   2.7     5       1

Therefore, use \t+ in the separator pattern instead of \t.

data = pd.read_csv(path, sep='\t+`, header=None)

Upvotes: 7

fact_finder

Reputation: 192

I have encountered this error with a stray quotation mark. I use mapping software which will put quotation marks around text items when exporting comma-delimited files. Text which uses quote marks (e.g. ' = feet and " = inches) can be problematic when then induce delimiter collisions. Consider this example which notes that a 5-inch well log print is poor:

UWI_key,Latitude,Longitude,Remark US42051316890000,30.4386484,-96.4330734,"poor 5""

Using 5" as shorthand for 5 inch ends up throwing a wrench in the works. Excel will simply strip off the extra quote mark, but Pandas breaks down without the error_bad_lines=False argument mentioned above.

Upvotes: 1

Zstack

Reputation: 4753

For those who are having similar issue with Python 3 on linux OS.

pandas.errors.ParserError: Error tokenizing data. C error: Calling
read(nbytes) on source failed. Try engine='python'.

Try:

df.read_csv('file.csv', encoding='utf8', engine='python')

Upvotes: 6

Naseer

Reputation: 127

Simple resolution: Open the csv file in excel & save it with different name file of csv format. Again try importing it spyder, Your problem will be resolved!

Upvotes: 4

d_-

Reputation: 1491

The following worked for me (I posted this answer, because I specifically had this problem in a Google Colaboratory Notebook):

df = pd.read_csv("/path/foo.csv", delimiter=';', skiprows=0, low_memory=False)

Upvotes: 18

user3426943

Reputation:

The dataset that I used had a lot of quote marks (") used extraneous of the formatting. I was able to fix the error by including this parameter for read_csv():

quoting=3 # 3 correlates to csv.QUOTE_NONE for pandas

Upvotes: 9

Bikash Joshi

Reputation: 103

Most of the useful answers are already mentioned, however I suggest saving the pandas dataframes as parquet file. Parquet files don't have this problem and they are memory efficient at the same time.

Upvotes: 1

fact_finder

Reputation: 192

I have encountered this error with a stray quotation mark. I use mapping software which will put quotation marks around text items when exporting comma-delimited files. Text which uses quote marks (e.g. ' = feet and " = inches) can be problematic. Consider this example which notes that a 5-inch well log print is poor:

UWI_key,Latitude,Longitude,Remark US42051316890000,30.4386484,-96.4330734,"poor 5""

Once you know the nature of your error, it may be easiest to do a Find-Replace from a text editor (e.g., Sublime Text 3 or Notepad++) prior to import.

Upvotes: 0

Scott Skiles

Reputation: 3857

The issue for me was that a new column was appended to my CSV intraday. The accepted answer solution would not work as every future row would be discarded if I used error_bad_lines=False.

The solution in this case was to use the usecols parameter in pd.read_csv(). This way I can specify only the columns that I need to read into the CSV and my Python code will remain resilient to future CSV changes so long as a header column exists (and the column names do not change).

usecols : list-like or callable, optional 

Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or
strings that correspond to column names provided either by the user in
names or inferred from the document header row(s). For example, a
valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar',
'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1,
0]. To instantiate a DataFrame from data with element order preserved
use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for
columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo',
'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.

Example

my_columns = ['foo', 'bar', 'bob']
df = pd.read_csv(file_path, usecols=my_columns)

Another benefit of this is that I can load way less data into memory if I am only using 3-4 columns of a CSV that has 18-20 columns.

Upvotes: 2

Steven Rouk

Reputation: 903

I had this problem, where I was trying to read in a CSV without passing in column names.

df = pd.read_csv(filename, header=None)

I specified the column names in a list beforehand and then pass them into names, and it solved it immediately. If you don't have set column names, you could just create as many placeholder names as the maximum number of columns that might be in your data.

col_names = ["col1", "col2", "col3", ...]
df = pd.read_csv(filename, names=col_names)

Upvotes: 47

pandas.parser.CParserError: Error tokenizing data

Answers (30)

Example

Related Questions