Álvaro V.
Álvaro V.

Reputation: 39

Google Collab doesn't parse CSV properly, while Jupyter Notebook does

I would like to read a csv with Pandas in Google Collab. I've used the standard coding:

train = pd.read_csv("here-I-enter-my-Google-Drive-URL-to-the-file", sep=",", lineterminator='\n', header=0)

but it returns:

ParserError: Error tokenizing data. C error: Expected 90 fields in line 3, saw 228

The point is that this is the first time I try Google Collab, and I've searched on-line about this problem. Most people say it's because maybe there's a comma somewhere on the data (besides those separating columns), however I've searched using the 'Find' function on a couple text and spreadsheet readers, and found nothing - actually, any spreadsheet reader reads it properly, as Jupyter Notebook does.

Let me copy here the first 10 lines from the file, which can be fully downloaded here:

Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1,60,RL,65,8450,Pave,NA,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,NA,Attchd,2003,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,NA,NA,NA,0,2,2008,WD,Normal,208500
2,20,RL,80,9600,Pave,NA,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,None,0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,NA,NA,NA,0,5,2007,WD,Normal,181500
3,60,RL,68,11250,Pave,NA,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,NA,NA,NA,0,9,2008,WD,Normal,223500
4,70,RL,60,9550,Pave,NA,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,None,0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,NA,NA,NA,0,2,2006,WD,Abnorml,140000
5,60,RL,84,14260,Pave,NA,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,NA,NA,NA,0,12,2008,WD,Normal,250000
6,50,RL,85,14115,Pave,NA,IR1,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,1Fam,1.5Fin,5,5,1993,1995,Gable,CompShg,VinylSd,VinylSd,None,0,TA,TA,Wood,Gd,TA,No,GLQ,732,Unf,0,64,796,GasA,Ex,Y,SBrkr,796,566,0,1362,1,0,1,1,1,1,TA,5,Typ,0,NA,Attchd,1993,Unf,2,480,TA,TA,Y,40,30,0,320,0,0,NA,MnPrv,Shed,700,10,2009,WD,Normal,143000
7,20,RL,75,10084,Pave,NA,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,1Fam,1Story,8,5,2004,2005,Gable,CompShg,VinylSd,VinylSd,Stone,186,Gd,TA,PConc,Ex,TA,Av,GLQ,1369,Unf,0,317,1686,GasA,Ex,Y,SBrkr,1694,0,0,1694,1,0,2,0,3,1,Gd,7,Typ,1,Gd,Attchd,2004,RFn,2,636,TA,TA,Y,255,57,0,0,0,0,NA,NA,NA,0,8,2007,WD,Normal,307000
8,60,RL,NA,10382,Pave,NA,IR1,Lvl,AllPub,Corner,Gtl,NWAmes,PosN,Norm,1Fam,2Story,7,6,1973,1973,Gable,CompShg,HdBoard,HdBoard,Stone,240,TA,TA,CBlock,Gd,TA,Mn,ALQ,859,BLQ,32,216,1107,GasA,Ex,Y,SBrkr,1107,983,0,2090,1,0,2,1,3,1,TA,7,Typ,2,TA,Attchd,1973,RFn,2,484,TA,TA,Y,235,204,228,0,0,0,NA,NA,Shed,350,11,2009,WD,Normal,200000
9,50,RM,51,6120,Pave,NA,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Artery,Norm,1Fam,1.5Fin,7,5,1931,1950,Gable,CompShg,BrkFace,Wd Shng,None,0,TA,TA,BrkTil,TA,TA,No,Unf,0,Unf,0,952,952,GasA,Gd,Y,FuseF,1022,752,0,1774,0,0,2,0,2,2,TA,8,Min1,2,TA,Detchd,1931,Unf,2,468,Fa,TA,Y,90,0,205,0,0,0,NA,NA,NA,0,4,2008,WD,Abnorml,129900
10,190,RL,50,7420,Pave,NA,Reg,Lvl,AllPub,Corner,Gtl,BrkSide,Artery,Artery,2fmCon,1.5Unf,5,6,1939,1950,Gable,CompShg,MetalSd,MetalSd,None,0,TA,TA,BrkTil,TA,TA,No,GLQ,851,Unf,0,140,991,GasA,Ex,Y,SBrkr,1077,0,0,1077,1,0,1,0,2,2,TA,5,Typ,2,TA,Attchd,1939,RFn,1,205,Gd,TA,Y,0,4,0,0,0,0,NA,NA,NA,0,1,2008,WD,Normal,118000

Any thoughts on how to solve this issue?

Thanks everyone!

Upvotes: 0

Views: 918

Answers (3)

guardian
guardian

Reputation: 311

After researching a little bit more, I did come with an answer, that might help you and everyone while working with datasets from different online sources:

Here my example with "failure":

import pandas as pd
df=pd.read_csv('https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data?select=train.csv')

This will generate a similar error : "Error tokenizing data. C error: Expected 1 fields in line 9, saw 2"

The question will be , why is the error occurring, and how to solve it, here what I did, use the on_bad_lines = "warn" :

    df=pd.read_csv('https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data?select=train.csv',on_bad_lines="warn")

This would generate an message :

'Skipping line 9: expected 1 fields, saw 2\nSkipping line 10: expected 1 fields, saw 3\nSkipping line 12: expected 1 fields, saw 4\nSkipping line 27: expected 1 fields, saw 8\nSkipping line 31...."

Even tough the df will be generated and with that we would be able to better interpret the data and find the error:

df
<!DOCTYPE html>
0   <html lang="en">
1   <head>
2   <title>House Prices - Advanced Regression Te...
3   <meta charset="utf-8" />
4   <meta name="turbolinks-cache-control" conten...
... ...
86  <div id="site-body" class="hide">
87  </div>
88  </main>
89  </body>
90  </html>

On this example, we can see, that the file was loaded as a html-link and not as csv, which generated the error, although, while using the locally saved file, no error happened.

After using the API Credentials (Kaggle, Google, Github, etc.) file could be loaded correctly and results were as expected:

df.head()
Id  MSSubClass  MSZoning    LotFrontage LotArea Street  Alley   LotShape    LandContour Utilities   ... PoolArea    PoolQC  Fence   MiscFeature MiscVal MoSold  YrSold  SaleType    SaleCondition   SalePrice
0   1   60  RL  65.0    8450    Pave    NaN Reg Lvl AllPub  ... 0   NaN NaN NaN 0   2   2008    WD  Normal  208500
1   2   20  RL  80.0    9600    Pave    NaN Reg Lvl AllPub  ... 0   NaN NaN NaN 0   5   2007    WD  Normal  181500
2   3   60  RL  68.0    11250   Pave    NaN IR1 Lvl AllPub  ... 0   NaN NaN NaN 0   9   2008    WD  Normal  223500
3   4   70  RL  60.0    9550    Pave    NaN IR1 Lvl AllPub  ... 0   NaN NaN NaN 0   2   2006    WD  Abnorml 140000
4   5   60  RL  84.0    14260   Pave    NaN IR1 Lvl AllPub  ... 0   NaN NaN NaN 0   12  2008    WD  Normal  25000

Hope this might help in the future independent of online source or environment.

Upvotes: 1

Shishu Kumar Choudhary
Shishu Kumar Choudhary

Reputation: 329

I have downloaded the dataset and replicated the scenario on my machine and it works fine here i present my implementation. Hope it will help you.

Python version: 3.7.14

Pandas version: 1.3.5

Environment: Google Colab

IMPLEMENTATION:

import pandas as pd

from google.colab import files
uploaded = files.upload()

df = pd.read_csv('train.csv') 

df.head(5)

Upvotes: 1

Nocry
Nocry

Reputation: 45

I just downloaded the data and tried importing it in colab with your code and it works fine there. Maybe try reimporting it to your google drive and get it into a DataFrame.

A thing I encountered is that if you run the code while the csv isn't fully imported (file shows in data section before import is complete) stuff breaks

Hope this was at least some help

Upvotes: 1

Related Questions