femibyte
femibyte

Reputation: 3507

S3AbortableInputStream warning when reading large file from S3 IN Pyspark in AWS EMR

I keep getting this error on AWS EMR when reading a large dataset from S3 in Pyspark.

INFO FileScanRDD: Reading File path: s3a://bucket/dir1/dir2/dir3/2018-01-31/part-XXX-YYYY-c000.snappy.parquet, 
range: 0-11383, partition values: [empty row]

WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. 
This is likely an error and may result in sub-optimal behavior. 
Request only the bytes you need via a ranged GET or drain the input stream after use.

The read is fairly standard:

df = spark.read.parquet(s3_path)

Has anyone encountered this error before ? Any suggestions ? Thanks in advance.

Upvotes: 2

Views: 2012

Answers (1)

Nick Chammas
Nick Chammas

Reputation: 12672

It's a warning, not an error, since it says WARN. You can safely ignore the warning or try upgrading to Hadoop 2.9 or 3.0 to get rid of it.

The warnings are thrown by the AWS Java SDK because Hadoop is intentionally aborting read operations early. (It looks like you are using s3a://, so Spark is interacting with S3 via Hadoop.)

You can read more about this warning from this discussion between the Hadoop committer who works on S3A and the AWS Java SDK maintainers. HADOOP-14890 suggests the warning will go away if you use Hadoop 2.9 or 3.0, which use a newer version of the AWS SDK.

Upvotes: 2

Related Questions