Mihirkumar Joshi
Mihirkumar Joshi

Reputation: 55

Spark : Modify CSV file and write to other folder

Folks,

We have one requirement where we wanted to do a minor transformation on CSV file and write the same in to other HDFS folder using spark.

e.g /input/csv1.txt (at least 4 GB file)

ID,Name,Address
100,john,some street

output should be in file (output/csv1.txt). Basically two new columns will be added after analyzing address ( Order of record should be same as input file)

ID,Name,Address,Country,ZipCode
100,Name,Address,India,560001

Looks like there is no easy to do this with spark.

Upvotes: 0

Views: 1504

Answers (1)

Glennie Helles Sindholt
Glennie Helles Sindholt

Reputation: 13154

Ehm, I don't know what you mean by no easy way - the spark-csv package makes it very easy IMHO. Depending on which version of Spark you are running, you need to do one of the following:

Spark 2.x

val df = spark.read.csv("/path/to/files/")
df
 .withColumn("country", ...)
 .withColumn("zip_code", ...)
 .write
 .csv("/my/output/path/")

Spark 1.x

val df = sqlContext.read.format("com.databricks.spark.csv").load(/path/to/my/files/")
df.     
 .withColumn("country", ...)
 .withColumn("zip_code", ...)
 .write
 .format("com.databricks.spark.csv")
 .save("/my/output/path/")

Note, that I just put withColumn here - you are probably joining with some other dataframe containing the country and zip code, but my example is just to illustrate how you read and write it with the spark-csv package (which has been build into Spark 2.x)

Upvotes: 1

Related Questions