Reputation: 13111
I have a bucket with several parquet files and billions of records in S3 bucket.
I want to be able to read the whole folder, filter line by line (e.g. if line contains specific element - filter out) and save it to another S3 location. Since all records total have several gigabytes - I want to read and save them line by line to another S3 bucket if possible.
I have only Pyspark (Glue) environment to do this so can not do on my laptop nor on EC2 (security reasons).
In Linux - I could easily achieve that with:
cat file.csv | grep -v "exclude value" > file2.csv
How to achieve that in S3?
Upvotes: 0
Views: 847
Reputation: 541
try below code that will work for you.
#read data from s3 and store it to dynamicframe and then convert dataframe using .toDF() function
datasource0 = glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options = {"paths": "s3://glue-sample-target/input-dir/medicare_parquet/*.parquet"}, format = "parquet", transformation_ctx = "datasource0").toDF()
#now filter the data based on column
filterdf=datasource0.filter(col("SOURCE") == "ABC")
#again convert dataframe to dynamicframe
filterdf_dynamic_frame = DynamicFrame.fromDF(filterdf, glueContext, "filterdf_dynamic_frame")
#now will write to s3
glueContext.write_dynamic_frame.from_options(
frame = filterdf_dynamic_frame,
connection_type = "s3",
connection_options = {"path": "s3://glue-sample-target/output-dir/medicare_parquet"},
format = "parquet")
Upvotes: 1