Reputation: 14037
I am reading in a csv using pandas chunks functionality. It works, except for I am not able to retain headers. Is there a way/option to do this? here is sample code:
import pyspark
import pandas as pd
sc = pyspark.SparkContext(appName="myAppName")
spark_rdd = sc.emptyRDD()
# filename: csv file
chunks = pd.read_csv(filename, chunksize=10000)
for chunk in chunks:
spark_rdd += sc.parallelize(chunk.values.tolist())
#print(chunk.head())
#print(spark_rdd.toDF().show())
#break
spark_df = spark_rdd.toDF()
spark_df.show()
Upvotes: 0
Views: 1234
Reputation: 1336
Try this :
import pyspark
import pandas as pd
sc = pyspark.SparkContext(appName="myAppName")
spark_rdd = sc.emptyRDD()
# Read ten rows to get column names
x = pd.read_csv(filename,nrows=10)
mycolumns = list(x)
# filename: csv file
chunks = pd.read_csv(filename, chunksize=10000)
for chunk in chunks:
spark_rdd += sc.parallelize(chunk.values.tolist())
spark_df = spark_rdd.map(lambda x:tuple(x)).toDF(mycolumns)
spark_df.show()
Upvotes: 1
Reputation: 14037
I ended up using databricks' spark-csv
sc = pyspark.SparkContext()
sql = pyspark.SQLContext(sc)
df = sql.read.load(filename,
format='com.databricks.spark.csv',
header='true',
inferSchema='true')
Upvotes: 0