Devender Prakash
Devender Prakash

Reputation: 91

How to read CSV files directly into spark DataFrames without using databricks csv api ?

How to read CSV files directly into spark DataFrames without using databricks csv api ?
I know there is databricks csv api but i cant use it that api..
I know there is case class to use and map the cols according to cols(0) positions but the problem is i have more than 22 coloumns hence i cant use case class because in case class we have limitation of using only 22 coloumns. I know there is structtype to define schema but i feel it would be very lenghty code to define 40 coloumns in structype. I am looking for something to read into dataframe using read method but in spark we dont have direct support for csv file we need to parse it ? but how if we have more than 40 cols.?

Upvotes: 1

Views: 640

Answers (2)

Ram Ghadiyaram
Ram Ghadiyaram

Reputation: 29165

Seems like scala 2.11.x onwards the arity limit issue was fixed. please have a look at https://issues.scala-lang.org/browse/SI-7296

To overcome this in <2.11 see my answer, which uses extends Product and overrides methods productArity, productElement,canEqual (that:Any)

Upvotes: 0

wmoco_6725
wmoco_6725

Reputation: 3169

I've also looked into this and ended up writing a python script to generate scala code for the parse(line) function and the definition of the schema. Yes, this may become a lenghty blob of code.

Another path you may walk if your data is not too big: use python pandas! Startup py-spark, read your data into a pandas dataframe, and then create a spark dataframe from that. Save it (eg. as a parquet file). And load that parquet file in scala-spark.

Upvotes: 0

Related Questions