Reputation: 1746
I've a csv file, which looks something like this
A B C
1 2
2 4
3 2 5
1 2 3
4 5 6
When I'm reading this data into spark, it's considering column C as "string" because of "blanks" in the first few rows.
Could anybody please tell me how to load this file in SQL dataframe so that column c remains integer (or float)?
I'm using "sc.textFile
" to read the data into spark, and then converting it into SQL dataframe.
I read this and this links. But they didn't help me much.
My code portion. In the last line of the code I'm getting the error.
myFile=sc.textFile(myData.csv)
header = myFile.first()
fields = [StructField(field_name, StringType(), True) for field_name in header.split(',')]
fields[0].dataType = FloatType()
fields[1].dataType = FloatType()
fields[2].dataType = FloatType()
schema = StructType(fields)
myFileCh = myFile.map(lambda k: k.split(",")).map(lambda p: (float(p[0]),float(p[1]),float(p[2])))
Thanks!
Upvotes: 1
Views: 1008
Reputation: 690
So the issue is with this unsafe casting. you could implement a short function that will perform a "safe" cast and return a defult value in case cast to fload fails.
def safe_cast(val, to_type, default=None):
try:
return to_type(val)
except ValueError:
return default
safe_cast('tst', float) # will return None
safe_cast('tst', float, 0.0) # will return 0.0
myFileCh = myFile.map(lambda k: k.split(",")).map(lambda p: (safe_cast(p[0], float),safe_cast(p[1], float),safe_cast(p[2], float)))
Upvotes: 1