Reputation: 43
I am using awswrangler for my project in Glue. The project should be pretty straightforward. I have dataframe with column "A". This is the content of the dataframe:
data = {'A': [['1A', '5A'], ['1A', '19A'], ['1A', '26A'], ['2A', '4A']]}
df = pd.DataFrame(data)
print(df)
#OUTPUT--------------------
A
0 [1A, 5A]
1 [1A, 19A]
2 [1A, 26A]
3 [2A, 4A]
I changed content of the dataframe into this format using apply()
:
df['formatted_data'] = df['A'].apply(to_dict)
print(my_df)
#OUTPUT---------------------------------
A formatted_data
0 [1A, 5A] {"1A": True, "5A": True}
1 [1A, 19A] {"1A": True, "19A": True}
2 [1A, 26A] {"1A": True, "26A": True}
3 [2A, 4A] {"2A": True, "4A": True}
How I achieve this is basically just creating dictionary where I set True for each value from original column. So far so good. Regular pandas dataframe. However once I try to save this dataframe using awswrangler with plain old to_parquet()
print(my_df)
wr.s3.to_parquet(
df=my_df,
path=f"s3://{path}",
dataset=True
)
I get following result in the generated file:
A formatted_data
0 [1A, 5A] [{'19A': None, '1A': 'true', '26A': None, '2A': None, '3A': None, '4A': None, '5A': None},{'19A': None, '1A': None, '26A': None, '2A': None, '3A': None, '4A': None, '5A': 'true'}]
1 [1A, 19A] [{'19A': None, '1A': 'true', '26A': None, '2A': None, '3A': None, '4A': None, '5A': None},{'19A': 'true', '1A': None, '26A': None, '2A': None, '3A': None, '4A': None, '5A': None}]
2 [1A, 26A] [{'19A': None, '1A': 'true', '26A': None, '2A': None, '3A': None, '4A': None, '5A': None},{'19A': None, '1A': None, '26A': 'true', '2A': None, '3A': None, '4A': None, '5A': None}]
3 [2A, 4A] [{'19A': None, '1A': None, '26A': None, '2A': 'true', '3A': None, '4A': None, '5A': None},{'19A': None, '1A': None, '26A': None, '2A': None, '3A': None, '4A': 'true', '5A': None}]
For whatever reason wrangler just decides to create this list on its own. Where it takes all possible values from the entire column and for the ones that are not in column "A" it creates 'None' values on its own. Only 2 key:value pairs should be in each row. But wrangler decides to add all possible values and sets them as 'None'.
Do you know what might be the issue here? Am I losing it? Am I crazy? I have some other settings in my code. However I am not doing any changes between the lines print(my_df)
and wr.s3.to_parquet()
Upvotes: 1
Views: 86