Reputation: 11
My use case is to write all the parquet filenames to a separate metadata file after writing it to GCS at the end of each window.
I have tried a set of different approaches, but with each approach I end up generating metadata files for a single window which would have partial data in them (parquet filenames written in a specific window spread across multiple metadata files).
Below is my desired output:
Metadata Filename: gs://my-bucket/path/to/my/metadata-file/metadata-20240117T12:40-20240117T12:45.txt
Metadata File Content:
gs://my-bucket/path/to/my/parquet-file/parquet-20240117T12:40-20240117T12:45-0.parquet
gs://my-bucket/path/to/my/parquet-file/parquet-20240117T12:40-20240117T12:45-1.parquet
gs://my-bucket/path/to/my/parquet-file/parquet-20240117T12:40-20240117T12:45-2.parquet
gs://my-bucket/path/to/my/parquet-file/parquet-20240117T12:40-20240117T12:45-3.parquet
gs://my-bucket/path/to/my/parquet-file/parquet-20240117T12:40-20240117T12:45-4.parquet
gs://my-bucket/path/to/my/parquet-file/parquet-20240117T12:40-20240117T12:45-5.parquet
The approaches I tried would put the same six filenames across 3-4 different metadata files.
What am I doing wrong here?
Here is my code that does the parquet writing: https://gist.github.com/iamadhee/c1a3c9ce7c4de89f543e32e5a006d0e5
Upvotes: 1
Views: 31