Reputation: 91
How to read CSV files directly into spark DataFrames without using databricks csv api ?
I know there is databricks csv api but i cant use it that api..
I know there is case class to use and map the cols according to cols(0) positions but the problem is i have more than 22 coloumns hence i cant use case class because in case class we have limitation of using only 22 coloumns.
I know there is structtype to define schema but i feel it would be very lenghty code to define 40 coloumns in structype.
I am looking for something to read into dataframe using read method but in spark we dont have direct support for csv file we need to parse it ? but how if we have more than 40 cols.?
Upvotes: 1
Views: 640
Reputation: 29165
Seems like scala 2.11.x onwards the arity limit issue was fixed. please have a look at https://issues.scala-lang.org/browse/SI-7296
To overcome this in <2.11 see my answer, which uses extends Product
and overrides methods productArity
, productElement
,canEqual (that:Any)
Upvotes: 0
Reputation: 3169
I've also looked into this and ended up writing a python script to generate scala code for the parse(line) function and the definition of the schema. Yes, this may become a lenghty blob of code.
Another path you may walk if your data is not too big: use python pandas! Startup py-spark, read your data into a pandas dataframe, and then create a spark dataframe from that. Save it (eg. as a parquet file). And load that parquet file in scala-spark.
Upvotes: 0