Reputation: 1517
I have a bunch of small (1KB to 1MB) text files stored in Amazon S3 that I would like to process using Amazon EMR's Hadoop.
Each record given to the mapper needs to contain the entire contents of a text file as well as some way to determine the filename, so I cannot use the default TextInputFormat.
What is the best way to accomplish this? Is there anything else I can do (like copying files from S3 to hdfs) to increase performance?
Upvotes: 0
Views: 885
Reputation: 2167
The best approach according to me would be to create an external table
on the CSV files and load it into another table stored again in S3
bucket in parquet
format. You will not have to write any script in that case, just few SQL queries.
CREATE EXTERNAL TABLE databasename.CSV_EXT_Module(
recordType BIGINT,
servedIMSI BIGINT,
ggsnAddress STRING,
chargingID BIGINT,
...
...
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://module/input/csv_files/'
TBLPROPERTIES ("skip.header.line.count"="1");
The above table will only be an external table mapped to the csv file.
Create another table on top of it if you want the query to run faster:
CREATE TABLE databasename.RAW_Module as
SELECT
recordType,
servedIMSI,
ggsnAddress,
chargingID,
...
regexp_extract(INPUT__FILE__NAME,'(.*)/(.*)',2) as filename from
databasename.CSV_EXT_Module
STORED AS PARQUET
LOCATION 's3://module/raw/parquet_files/';
Change the regexp_extract to have the required input file name.
Upvotes: 1
Reputation: 1802
I had the same issue. Please refer following questions.
If you don't have any large files, but have a lot of files, it's sufficient to use s3cmd get --recursive s3://<url> .
command. After retrieved files into EMR instance, you could create tables with Hive. For example, you can load whole files with LOAD DATA
statement with partition.
sample
This is a sample code
#!/bin/bash
s3cmd get --recursive s3://your.s3.name .
# create table with partitions
hive -e "SET mapred.input.dir.recursive=true; DROP TABLE IF EXISTS import_s3_data;"
hive -e "CREATE TABLE import_s3_data( rawdata string )
PARTITIONED BY (tier1 string, tier2, string, tier3 string);"
LOAD_SQL=""
# collect files as array
FILES=(`find . -name \*.txt -print`)
for FILE in ${FILES[@]}
do
DIR_INFO=(`echo ${FILE##./} | tr -s '/' ' '`)
T1=${DIR_INFO[0]}
T2=${DIR_INFO[1]}
T3=${DIR_INFO[2]}
LOAD_SQL="${LOAD_SQL} LOAD DATA LOCAL INPATH '${FILE}' INTO TABLE
import_s3_data PARTITION (tier1 = '${T1}', tier2 = '${T2}', tier3 = '${T3}');"
done
hive -e "${LOAD_SQL}"
another option
I think there are some another options to retrieve small S3 data
s3cmd get
. It might more effective in such a case, there are many large raw or gziped files on S3.Upvotes: 1