Reputation: 2253
I am studying Sparkr. I have a csv file:
a <- read.df(sqlContext,"./mine/a2014.csv","csv")
I want to use write.df to store this file. However, when I use:
write.df(a,"mine/a.csv")
I get a folder called a.csv, in which there is no csv file at all.
Upvotes: 1
Views: 1484
Reputation: 3939
Spark partitions your data into blocks, so it can distribute those partitions over the nodes in your cluster. When writing the data, it retains this partitioning: it creates a directory and writes each partition to a separate file. This way it can take advantage of distributed file systems better (writing each block in parallel to HDFS/S3), and it doesn't have to collect all the data to a single machine which may not be capable of handling the the amount of data.
The two files with the long names are the 2 partitions of your data and hold the actual CSV data. You can see this by copying them, renaming the copies with a .csv
extension and double clicking them, or with something like head longfilename
.
You can test whether the write was successful by trying to read it back in: give Spark the path to the directory and it will recognize it as a partitioned file, through the metadata and _SUCCESS
files you mentioned.
If you do need all the data in one file, you can do that by using repartition
to reduce the amount of partitions to 1 and then write it:
b <- repartition(a, 1)
write.df(b,"mine/b.csv")
This will result in just one long-named file which is a CSV file with all the data.
(I don't use SparkR so untested; in Scala/PySpark you would prefer to use coalesce
rather than repartition
but I couldn't find an equivalent SparkR function)
Upvotes: 2