How to read a file line by line from S3 parquet, filter and save line by line to another S3 bucket?

Question

I have a bucket with several parquet files and billions of records in S3 bucket.

I want to be able to read the whole folder, filter line by line (e.g. if line contains specific element - filter out) and save it to another S3 location. Since all records total have several gigabytes - I want to read and save them line by line to another S3 bucket if possible.

I have only Pyspark (Glue) environment to do this so can not do on my laptop nor on EC2 (security reasons).

In Linux - I could easily achieve that with:
cat file.csv | grep -v "exclude value" > file2.csv

How to achieve that in S3?

How to read a file line by line from S3 parquet, filter and save line by line to another S3 bucket?

Answers (1)

Related Questions