Reputation: 5
I am trying to use Boto3 to 'query' a .CSV within an s3 bucket and spit the data into a Pandas Dataframe object. It is 'working'-- with (almost all of the data) in a single column.
Here is the Python (thanks 20 Chrome tabs and stackoverflow threads):
import pandas as pd
import boto3
import io
s3 = boto3.client(service_name='s3',
aws_access_key_id = 'redacted',
aws_secret_access_key = 'redacted')
#just selecting everything until I get this proof of concept finished
query = """SELECT *
FROM S3Object"""
obj = s3.select_object_content(
Bucket='redacted',
Key='redacted',
ExpressionType='SQL',
Expression=query,
InputSerialization={'CSV': {'FileHeaderInfo': 'Use',
'RecordDelimiter': '|'}},
OutputSerialization={'CSV': {}})
records = []
for event in obj['Payload']:
if 'Records' in event:
records.append(event['Records']['Payload'])
elif 'Stats' in event:
stats = event['Stats']['Details']
file_str = ''.join(r.decode('utf-8') for r in records)
df = pd.read_csv(io.StringIO(file_str))
This is what the .CSV in the s3 bucket looks like:
Field_1
"HeaderA""|""HeaderB""|""HeaderC""|""HeaderD"
"valueA1""|""valueB1""|""valueC1""|""valueD1"
"valueA2""|""valueB2""|""valueC2""|""valueD2"
"valueA3""|""valueB3""|""valueC3""|""valueD3"
.
.
.
"valueAn""|""valueBn""|""valueCn""|""valueDn"
And here is my current Dataframe output:
HeaderB
------------
HeaderC
HeaderD
valueA1
valueB1
valueC1
valueD1
valueA2
valueB2
valueC2
valueD2
...
valueDn
Desired output is 4 columns by n rows (plus headers) Any ideas on how to fix this?
. . . Edit:
InputSerialization={'CSV': {'FileHeaderInfo': 'None',
'FieldDelimiter': '"',
'AllowQuotedRecordDelimiter': True
}}
That got me 95% of the way there. The pipes were added as columns in the dataframe. Solution:
for col in df.columns:
if col[0] == '|':
df = df.drop(col, axis = 1)
Edit 2: This solution works when pulling the entire CSV with SELECT *. Now that this works, I've moved on to the next proof of concept, which is using a more specific query. There were some discrepancies with what was returned vs. what I could verify by looking directly at the CSV. I think this is due to the first line of the CSV being Field_1, followed by the actual header fields and record values. My current theory is that with this first line removed from the original input, I will be able to field-delimit on the quoted pipe and record-delimit on the newline and get the results I want. I am reaching out to the team responsible for these s3 dumps to see if the first line can be removed.
Upvotes: 0
Views: 2457
Reputation: 5648
New csv file
Field_1
"HeaderA""|""HeaderB""|""HeaderC""|""HeaderD"
"a_val1""|""bv3""|""1""|""10"
"a_val2""|""bv4""|""1""|""20"
"a_val3""|""bv4""|""3""|""40"
"a_val4""|""bv6""|""4""|""40"
def get_results(query):
obj = s3.select_object_content(
Bucket=bucket,
Key=key,
ExpressionType='SQL',
Expression=query,
InputSerialization={'CSV': {'FileHeaderInfo': 'IGNORE',
'FieldDelimiter': '"',
'AllowQuotedRecordDelimiter': True
}},
OutputSerialization={'CSV': {}})
# print(list(obj['Payload']))
records = []
for event in obj['Payload']:
if 'Records' in event:
records.append(event['Records']['Payload'])
elif 'Stats' in event:
stats = event['Stats']['Details']
file_str = ''.join(r.decode('utf-8') for r in records)
df = pd.read_csv(io.StringIO(file_str))
# df = df.filter(regex='Header')
return df
To get this work, ignore the headers (the first line of file) and then specifically search for it in where/and clause. FIguring out the column positions is the time consuming part.
query = '''SELECT s._2, s._6, s._10, s._14 FROM S3Object s where s._6 = 'bv4' or s._6 = 'HeaderB' '''
query = '''SELECT s._2, s._6 FROM S3Object s where s._6 = 'bv4' or s._6 = 'HeaderB' '''
get_results(query)
Here are the outputs of the two queries
HeaderA HeaderB HeaderC HeaderD
0 a_val2 bv4 1 20
1 a_val3 bv4 3 40
HeaderA HeaderB
0 a_val2 bv4
1 a_val3 bv4
Upvotes: 4