David Michael
David Michael

Reputation: 5

Boto3 s3 Select CSV to Pandas Dataframe-- trouble delimiting

I am trying to use Boto3 to 'query' a .CSV within an s3 bucket and spit the data into a Pandas Dataframe object. It is 'working'-- with (almost all of the data) in a single column.

Here is the Python (thanks 20 Chrome tabs and stackoverflow threads):

import pandas as pd
import boto3
import io

s3 = boto3.client(service_name='s3',
                  aws_access_key_id = 'redacted',
                  aws_secret_access_key = 'redacted')

#just selecting everything until I get this proof of concept finished
query = """SELECT *
        FROM S3Object"""

obj = s3.select_object_content(
    Bucket='redacted',
    Key='redacted',
    ExpressionType='SQL',
    Expression=query,
    InputSerialization={'CSV': {'FileHeaderInfo': 'Use', 
                                'RecordDelimiter': '|'}},
    OutputSerialization={'CSV': {}})

records = []
for event in obj['Payload']:
    if 'Records' in event:
        records.append(event['Records']['Payload'])
    elif 'Stats' in event:
        stats = event['Stats']['Details']

file_str = ''.join(r.decode('utf-8') for r in records)

df = pd.read_csv(io.StringIO(file_str))

This is what the .CSV in the s3 bucket looks like:

Field_1
"HeaderA""|""HeaderB""|""HeaderC""|""HeaderD"
"valueA1""|""valueB1""|""valueC1""|""valueD1"
"valueA2""|""valueB2""|""valueC2""|""valueD2"
"valueA3""|""valueB3""|""valueC3""|""valueD3"
.
.
.
"valueAn""|""valueBn""|""valueCn""|""valueDn"

And here is my current Dataframe output:

HeaderB
------------
HeaderC
HeaderD
valueA1
valueB1
valueC1
valueD1
valueA2
valueB2
valueC2
valueD2
...
valueDn

Desired output is 4 columns by n rows (plus headers) Any ideas on how to fix this?

. . . Edit:

InputSerialization={'CSV': {'FileHeaderInfo': 'None', 
                                'FieldDelimiter': '"',
                                'AllowQuotedRecordDelimiter': True
                               }}

That got me 95% of the way there. The pipes were added as columns in the dataframe. Solution:

for col in df.columns:
    if col[0] == '|':
        df = df.drop(col, axis = 1)

Edit 2: This solution works when pulling the entire CSV with SELECT *. Now that this works, I've moved on to the next proof of concept, which is using a more specific query. There were some discrepancies with what was returned vs. what I could verify by looking directly at the CSV. I think this is due to the first line of the CSV being Field_1, followed by the actual header fields and record values. My current theory is that with this first line removed from the original input, I will be able to field-delimit on the quoted pipe and record-delimit on the newline and get the results I want. I am reaching out to the team responsible for these s3 dumps to see if the first line can be removed.

Upvotes: 0

Views: 2457

Answers (1)

Jonathan Leon
Jonathan Leon

Reputation: 5648

New csv file

Field_1
"HeaderA""|""HeaderB""|""HeaderC""|""HeaderD"
"a_val1""|""bv3""|""1""|""10"
"a_val2""|""bv4""|""1""|""20"
"a_val3""|""bv4""|""3""|""40"
"a_val4""|""bv6""|""4""|""40"

def get_results(query):
    obj = s3.select_object_content(
        Bucket=bucket,
        Key=key,
        ExpressionType='SQL',
        Expression=query,
        InputSerialization={'CSV': {'FileHeaderInfo': 'IGNORE',
                                    'FieldDelimiter': '"',
                                    'AllowQuotedRecordDelimiter': True
                                    }},
        OutputSerialization={'CSV': {}})

    # print(list(obj['Payload']))
    records = []
    for event in obj['Payload']:
        if 'Records' in event:
            records.append(event['Records']['Payload'])
        elif 'Stats' in event:
            stats = event['Stats']['Details']

    file_str = ''.join(r.decode('utf-8') for r in records)

    df = pd.read_csv(io.StringIO(file_str))
    # df = df.filter(regex='Header')
    return df

To get this work, ignore the headers (the first line of file) and then specifically search for it in where/and clause. FIguring out the column positions is the time consuming part.

query = '''SELECT s._2, s._6, s._10, s._14 FROM S3Object s where s._6 = 'bv4' or s._6 = 'HeaderB' '''
query = '''SELECT s._2, s._6 FROM S3Object s where s._6 = 'bv4' or s._6 = 'HeaderB' '''

get_results(query)

Here are the outputs of the two queries

  HeaderA HeaderB  HeaderC  HeaderD
0  a_val2     bv4        1       20
1  a_val3     bv4        3       40

      HeaderA HeaderB
    0  a_val2     bv4
    1  a_val3     bv4

Upvotes: 4

Related Questions