Reputation: 173
I am trying to fetch a subset of records from a CSV stored in an S# bucket using the following code:
s3 = boto3.client('s3')
bucket = bucket
file_name = file
sql_stmt = """SELECT S.* FROM s3object S LIMIT 10"""
req = s3.select_object_content(
Bucket=bucket,
Key=file,
ExpressionType='SQL',
Expression=sql_stmt,
InputSerialization = {'CSV': {'FileHeaderInfo': 'USE'}},
OutputSerialization = {'CSV': {}},
)
records = []
for event in req['Payload']:
if 'Records' in event:
records.append(event['Records']['Payload'])
elif 'Stats' in event:
stats = event['Stats']['Details']
file_str = ''.join(r.decode('utf-8') for r in records)
select_df = pd.read_csv(StringIO(file_str))
df = pd.DataFrame(select_df)
print(df)
This successfully yields the records but misses out on headers.
I read here S3 Select CSV Headers that S3 Select does not yield headers at all. So, is it possible to retrieve the headers of a CSV file in S3 in any other way?
Upvotes: 5
Views: 6209
Reputation: 543
To short,
FileHeaderInfo (string) -- Describes the first line of input.
Valid values are:
NONE : First line is not a header.
IGNORE : First line is a header, but you can't use the header values to indicate the column in an expression. You can use column position (such as _1, _2, …) to indicate the column (SELECT s._1 FROM OBJECT s ).
Use : First line is a header, and you can use the header value to identify a column in an expression (SELECT "name" FROM OBJECT ).
Upvotes: 2
Reputation: 303
Red Boy's solution doesn't allow you to use the column names in the query and instead, you have to use the column indexes. This wasn't good for me so my solution was to do another query to only get the headers and concatenate them with the actual query result. This is on JavaScript but the same should apply to Python:
const params = {
Bucket: bucket,
Key: "file.csv",
ExpressionType: 'SQL',
Expression: `select * from s3object s where s."date" >= '${fromDate}'`,
InputSerialization: {'CSV': {"FileHeaderInfo": "USE"}},
OutputSerialization: {'CSV': {}},
};
//s3 select doesn't return the headers, so need to run another query to only get the headers (see '{"FileHeaderInfo": "NONE"}')
const headerParams = {
Bucket: bucket,
Key: "file.csv",
ExpressionType: 'SQL',
Expression: "select * from s3object s limit 1", //this will only get the first record of the csv, and since we are not parsing headers, they will be included
InputSerialization: {'CSV': {"FileHeaderInfo": "NONE"}},
OutputSerialization: {'CSV': {}},
};
//concatenate header + data -- getObject is a method that handles the request
return await this.getObject(s3, headerParams) + await this.getObject(s3, params);
Upvotes: 2
Reputation: 5729
Change InputSerialization={'CSV': {"FileHeaderInfo": "Use"}},
TO InputSerialization={'CSV': {"FileHeaderInfo": "NONE"}},
Then, it will print full content, including the header
.
Explanation:
FileHeaderInfo
accepts one of "NONE" OR "USE" OR "IGNORE".
Use NONE
option rather then USE
, it will then print header
as well, as NONE
tells that you need header
as well for processing
.
Here is reference. https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.select_object_content
I hope it helps.
Upvotes: 2