Reputation: 317
I am reading a number of csv files in using a loop, all have 38 columns. I add them all to a list and then concatenate/create a dataframe. My issue is that despite all these csv files having 38 columns, my resultant dataframe somehow ends up with 105 columns.
Here is a screenshot:
How can I make the resultant dataframe have the correct 38 columns and stack all of rows on top of each other?
import boto3
import pandas as pd
import io
s3 = boto3.resource('s3')
client = boto3.client('s3')
bucket = s3.Bucket('alpha-enforcement-data-engineering')
appended_data = []
for obj in bucket.objects.filter(Prefix='closed/closed_processed/year_201'):
print(obj.key)
df = pd.read_csv(f's3://alpha-enforcement-data-engineering/{obj.key}', low_memory=False)
print(df.shape)
appended_data.append(df)
df_closed = pd.concat(appended_data, axis=0, sort=False)
print(df_closed.shape)
Upvotes: 0
Views: 421
Reputation: 402653
TLDR; check your column headers.
c = appended_data[0].columns
df_closed = pd.concat([df.set_axis(
c, axis=1, inplace=False) for df in appended_data], sort=False)
This happens because your column headers are different. Pandas will align your DataFrames on the headers when concatenating vertically, and will insert empty columns for DataFrames where that header is not present. Here's an illustrative example:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [7, 8, 9], 'D': [10, 11, 12]})
df
A B
0 1 4
1 2 5
2 3 6
df2
C D
0 7 10
1 8 11
2 9 12
pd.concat([df, df2], axis=0, sort=False)
A B C D
0 1.0 4.0 NaN NaN
1 2.0 5.0 NaN NaN
2 3.0 6.0 NaN NaN
0 NaN NaN 7.0 10.0
1 NaN NaN 8.0 11.0
2 NaN NaN 9.0 12.0
Creates 4 columns. Whereas, you wanted only two. Try,
df2.columns = df.columns
pd.concat([df, df2], axis=0, sort=False)
A B
0 1 4
1 2 5
2 3 6
0 7 10
1 8 11
2 9 12
Which works as expected.
Upvotes: 2