Joe
Joe

Reputation: 13111

How to read a file line by line from S3 parquet, filter and save line by line to another S3 bucket?

I have a bucket with several parquet files and billions of records in S3 bucket.

I want to be able to read the whole folder, filter line by line (e.g. if line contains specific element - filter out) and save it to another S3 location. Since all records total have several gigabytes - I want to read and save them line by line to another S3 bucket if possible.

I have only Pyspark (Glue) environment to do this so can not do on my laptop nor on EC2 (security reasons).

In Linux - I could easily achieve that with:
cat file.csv | grep -v "exclude value" > file2.csv

How to achieve that in S3?

Upvotes: 0

Views: 847

Answers (1)

Jay Kakadiya
Jay Kakadiya

Reputation: 541

try below code that will work for you.

#read data from s3 and store it to dynamicframe and then convert dataframe using .toDF() function
datasource0 = glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options = {"paths": "s3://glue-sample-target/input-dir/medicare_parquet/*.parquet"}, format = "parquet", transformation_ctx = "datasource0").toDF()

#now filter the data based on column
filterdf=datasource0.filter(col("SOURCE") == "ABC")

#again convert dataframe to dynamicframe
filterdf_dynamic_frame = DynamicFrame.fromDF(filterdf, glueContext, "filterdf_dynamic_frame")

#now will write to s3
glueContext.write_dynamic_frame.from_options(
   frame = filterdf_dynamic_frame,
   connection_type = "s3",
   connection_options = {"path": "s3://glue-sample-target/output-dir/medicare_parquet"},
   format = "parquet")

Upvotes: 1

Related Questions