Reputation: 783
I am a Pyspark newbie and have recently joined a project which uses a ETL Framework/ETL Pipeline developed in Pyspark that ingests CSV files from S3
(by reading the data in a DF
) and processes them and then loads them to Hive tables (Staging layer). The framework accepts an ETL config file which contains transformations in form of Spark SQL statements (using temp views). It then reads the data from the staging table and applies these transformations and finally loads the data to the Hive tables in the DWH target.
The transformations mentioned above are submitted to the EMR cluster running Spark (2.4) on YARN.
What is the relationship of the S3 in this case to HDFS (EMR)? I have asked this question to others in the team but didn't get a complete picture.
Now, as per my understanding, the input files along with the underlying data files of the Hive tables are stored on S3. When I run the following ls
command for a particular table, it displays all the 10 partitions of the data that constitutes the table -
aws s3 ls s3://my_bucket/cust_dw/cust_dm_customer_dtls/
2022-05-02 08:24:24 15236547 part-00000-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24 47685934 part-00001-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24 59204612 part-00002-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24 81294375 part-00003-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24 84390123 part-00004-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24 19238712 part-00005-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24 19348723 part-00006-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24 28934198 part-00007-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24 18398123 part-00008-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
2022-05-02 08:24:24 93478230 part-00009-42384ef3-05ae-4b64-abd9-9fe48b9852bf-c000.snappy.parquet
Now, as per my understanding, EMR is basically an HDFS
cluster with many nodes. I also believe that data files are stored in distributed fashion on HDFS
with a portion of data stored on each node with certain degree of replication.
So, my question is - if the data is stored on S3
, then it is not being stored on HDFS, right.
Since in this case, data is being stored on S3 instead, why do we need HDFS ? Is it that, in this case, EMR is just being used as a distributed environment for processing data shuffled by Spark (during a wide-transformation)? and that the HDFS dos not store the actual data which are instead stored on S3?
So the fact is that in this environment, input data is stored on S3, which is then read by PySpark framework, and which in turn uses HDFS nodes only for processing data in a distributed fashion by distributing the data during shuffles?
Upvotes: 0
Views: 1653
Reputation: 20042
You're confusing these two. You can't use S3 in EMR instead of Hadoop HDFS file system.
HDFS and the EMR File System (EMRFS), which uses Amazon S3, are both compatible with Amazon EMR, but they're not interchangeable. HDFS is an implementation of the Hadoop FileSystem API, which models POSIX file system behavior. EMRFS is an object store, not a file system.
You might benefit from reading Hadoop documentation for Object Stores vs. Filesystems.
To learn more about AWS EMR storage and file systems, and when to use which, read this.
Finally, if you find this useful, don't forget to read this.
Upvotes: 2