Reputation: 323
I'm getting an error while displaying a CSV file through Pyspark. I've attached the PySpark code and CSV file that I used.
from pyspark.sql import *
spark.conf.set("fs.azure.account.key.xxocxxxxxxx","xxxxx")
time_on_site_tablepath= "wasbs://[email protected]/time_on_site.csv"
time_on_site = spark.read.format("csv").options(header='true', inferSchema='true').load(time_on_site_tablepath)
display(time_on_site.head(50))
The error is shown below
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling
CSV file format is attached below
time_on_site:pyspark.sql.dataframe.DataFrame
next_eventdate:timestamp
barcode:integer
eventdate:timestamp
sno:integer
eventaction:string
next_action:string
next_deviceid:integer
next_device:string
type_flag:string
site:string
location:string
flag_perimeter:integer
deviceid:integer
device:string
tran_text:string
flag:integer
timespent_sec:integer
gg:integer
CSV file data is attached below
next_eventdate,barcode,eventdate,sno,eventaction,next_action,next_deviceid,next_device,type_flag,site,location,flag_perimeter,deviceid,device,tran_text,flag,timespent_sec,gg
2018-03-16 05:23:34.000,1998296,2018-03-14 18:50:29.000,1,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,124385,0
2018-03-17 07:22:16.000,1998296,2018-03-16 18:41:09.000,3,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,45667,0
2018-03-19 07:23:55.000,1998296,2018-03-17 18:36:17.000,6,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,1,132458,1
2018-03-21 07:25:04.000,1998296,2018-03-19 18:23:26.000,8,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,133298,0
2018-03-24 07:33:38.000,1998296,2018-03-23 18:39:04.000,10,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,46474,0
What could be done to load the CSV file successfully?
Upvotes: 2
Views: 4095
Reputation: 1
For some reason, probably a bug, even if you provide a schema on the spark.read.schema(my_schema).csv('path')
call
you get the same error on a display(df.head())
call
the display(df)
works though, but it gave me a WTF moment.
Upvotes: 0
Reputation: 397
There is no issue in your syntax, it's working fine.
The issue is in your data of CSV file, where the column named as type_flag
have only None(null) values, So it doesn't infer it's Datatype.
So, here are two options.
you can display the data without using head(). Like
display(time_on_site)
If you want to use head()
then you need to replace the null value, at here I replaced it with the empty string('').
time_on_site = time_on_site.fillna('')
display(time_on_site.head(50))
Upvotes: 3