PasLeChoix
PasLeChoix

Reputation: 311

how to load large csv with many fields to Spark

Happy New Year!!!

I know this type of similar question has been asked/answered before, however, mine is different:

I have large size csv with 100+ fields and 100MB+, I want to load it to Spark (1.6) for analysis, the csv's header looks like the attached sample (only one line of the data)

Thank you very much.

UPDATE 1(2016.12.31.1:26pm EST):

I use the following approach and was able to load data (sample data with limited columns), however, I need to auto assign the header (from the csv) as the field's name in the DataFrame, BUT, the DataFrame looks like:

enter image description here Can anyone tell me how to do it? Note, any manual manner is what I want to avoid.

>>> import csv
>>> rdd = sc.textFile('file:///root/Downloads/data/flight201601short.csv') 
>>> rdd = rdd.mapPartitions(lambda x: csv.reader(x))
>>> rdd.take(5) 
>>> df = rdd.toDF() 
>>> df.show(5) 

Upvotes: 3

Views: 9710

Answers (1)

O. Gindele
O. Gindele

Reputation: 376

As noted in the comments you can use spark.read.csv for spark 2.0.0+ (https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html)

df = spark.read.csv('your_file.csv', header=True, inferSchema=True)

Setting header to True will parse the header to column names of the dataframe. Setting inferSchema to True will get the table schema (but will slow down reading).

See also here: Load CSV file with Spark

Upvotes: 5

Related Questions