Data stored on S3 vs on HDFS

Question

I am a Pyspark newbie and have recently joined a project which uses a ETL Framework/ETL Pipeline developed in Pyspark that ingests CSV files from S3 (by reading the data in a DF) and processes them and then loads them to Hive tables (Staging layer). The framework accepts an ETL config file which contains transformations in form of Spark SQL statements (using temp views). It then reads the data from the staging table and applies these transformations and finally loads the data to the Hive tables in the DWH target. The transformations mentioned above are submitted to the EMR cluster running Spark (2.4) on YARN.

What is the relationship of the S3 in this case to HDFS (EMR)? I have asked this question to others in the team but didn't get a complete picture.

Now, as per my understanding, the input files along with the underlying data files of the Hive tables are stored on S3. When I run the following ls command for a particular table, it displays all the 10 partitions of the data that constitutes the table -
aws s3 ls s3://my_bucket/cust_dw/cust_dm_customer_dtls/

2022-05-02 08:24:24   15236547 part-00000-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24   47685934 part-00001-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24   59204612 part-00002-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24   81294375 part-00003-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24   84390123 part-00004-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24   19238712 part-00005-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24   19348723 part-00006-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24   28934198 part-00007-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24   18398123 part-00008-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24   93478230 part-00009-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet

Now, as per my understanding, EMR is basically an HDFS cluster with many nodes. I also believe that data files are stored in distributed fashion on HDFS with a portion of data stored on each node with certain degree of replication.

So, my question is - if the data is stored on S3, then it is not being stored on HDFS, right.

Since in this case, data is being stored on S3 instead, why do we need HDFS ? Is it that, in this case, EMR is just being used as a distributed environment for processing data shuffled by Spark (during a wide-transformation)? and that the HDFS dos not store the actual data which are instead stored on S3?

So the fact is that in this environment, input data is stored on S3, which is then read by PySpark framework, and which in turn uses HDFS nodes only for processing data in a distributed fashion by distributing the data during shuffles?

Data stored on S3 vs on HDFS

Answers (1)

Related Questions