Reputation: 164
I would like to produce stratified TFrecord files from a large DataFrame based on a certain condition, for which I use write.partitionBy()
. I'm also using the tensorflow-connector in SPARK, but this apparently does not work together with a write.partitionBy()
operation. Therefore, I have not found another way than to try to work in two steps:
partitionBy()
and write the resulting partitions to parquet files.It is the second step that I'm unable to do efficiently. My idea was to read in the individual parquet files on the executors and immediately write them into TFrecord files. But this needs access to the SQLContext which can only be done in the Driver (discussed here) so not in parallel. I would like to do something like this:
# List all parquet files to be converted
import glob, os
files = glob.glob('/path/*.parquet'))
sc = SparkSession.builder.getOrCreate()
sc.parallelize(files, 2).foreach(lambda parquetFile: convert_parquet_to_tfrecord(parquetFile))
Could I construct the function convert_parquet_to_tfrecord
that would be able to do this on the executors?
I've also tried just using the wildcard when reading all the parquet files:
SQLContext(sc).read.parquet('/path/*.parquet')
This indeed reads all parquet files, but unfortunately not into individual partitions. It appears that the original structure gets lost, so it doesn't help me if I want the exact contents of the individual parquet files converted into TFrecord files.
Any other suggestions?
Upvotes: 5
Views: 5102
Reputation: 46
Try spark-tfrecord.
Spark-TFRecord is a tool similar to spark-tensorflow-connector but it does partitionBy. The following example shows how to partition a dataset.
import org.apache.spark.sql.SaveMode
// create a dataframe
val df = Seq((8, "bat"),(8, "abc"), (1, "xyz"), (2, "aaa")).toDF("number", "word")
val tf_output_dir = "/tmp/tfrecord-test"
// dump the tfrecords to files.
df.repartition(3, col("number")).write.mode(SaveMode.Overwrite).partitionBy("number").format("tfrecord").option("recordType", "Example").save(tf_output_dir)
More information can be found at Github repo: https://github.com/linkedin/spark-tfrecord
Upvotes: 2
Reputation: 158
If I understood your question correctly, you want to write the partitions locally on the workers' disk.
If that is the case then I would recommend looking at spark-tensorflow-connector's instructions on how to do so.
This is the code that you are looking for (as stated in the documentation linked above):
myDataFrame.write.format("tfrecords").option("writeLocality", "local").save("/path")
On a side note, if you are worried about efficiency why are you using pyspark? It would be better to use scala instead.
Upvotes: 0