Reputation: 55
Folks,
We have one requirement where we wanted to do a minor transformation on CSV file and write the same in to other HDFS folder using spark.
e.g /input/csv1.txt (at least 4 GB file)
ID,Name,Address
100,john,some street
output should be in file (output/csv1.txt). Basically two new columns will be added after analyzing address ( Order of record should be same as input file)
ID,Name,Address,Country,ZipCode
100,Name,Address,India,560001
Looks like there is no easy to do this with spark.
Upvotes: 0
Views: 1504
Reputation: 13154
Ehm, I don't know what you mean by no easy way - the spark-csv
package makes it very easy IMHO. Depending on which version of Spark you are running, you need to do one of the following:
Spark 2.x
val df = spark.read.csv("/path/to/files/")
df
.withColumn("country", ...)
.withColumn("zip_code", ...)
.write
.csv("/my/output/path/")
Spark 1.x
val df = sqlContext.read.format("com.databricks.spark.csv").load(/path/to/my/files/")
df.
.withColumn("country", ...)
.withColumn("zip_code", ...)
.write
.format("com.databricks.spark.csv")
.save("/my/output/path/")
Note, that I just put withColumn
here - you are probably joining with some other dataframe containing the country and zip code, but my example is just to illustrate how you read and write it with the spark-csv package (which has been build into Spark 2.x)
Upvotes: 1