Falc
Falc

Reputation: 317

Concatenating dataframes creates too many columns

I am reading a number of csv files in using a loop, all have 38 columns. I add them all to a list and then concatenate/create a dataframe. My issue is that despite all these csv files having 38 columns, my resultant dataframe somehow ends up with 105 columns.

Here is a screenshot:

image

How can I make the resultant dataframe have the correct 38 columns and stack all of rows on top of each other?

import boto3
import pandas as pd
import io

s3 = boto3.resource('s3')
client = boto3.client('s3')
bucket = s3.Bucket('alpha-enforcement-data-engineering')

appended_data = []

for obj in bucket.objects.filter(Prefix='closed/closed_processed/year_201'):
    print(obj.key)
    df = pd.read_csv(f's3://alpha-enforcement-data-engineering/{obj.key}', low_memory=False)
    print(df.shape)
    appended_data.append(df)

df_closed = pd.concat(appended_data, axis=0, sort=False)


print(df_closed.shape)

Upvotes: 0

Views: 421

Answers (1)

cs95
cs95

Reputation: 402653

TLDR; check your column headers.

c = appended_data[0].columns

df_closed = pd.concat([df.set_axis(
    c, axis=1, inplace=False) for df in appended_data], sort=False)

This happens because your column headers are different. Pandas will align your DataFrames on the headers when concatenating vertically, and will insert empty columns for DataFrames where that header is not present. Here's an illustrative example:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [7, 8, 9], 'D': [10, 11, 12]})
df
   A  B
0  1  4
1  2  5
2  3  6

df2
   C   D
0  7  10
1  8  11
2  9  12

pd.concat([df, df2], axis=0, sort=False)

     A    B    C     D
0  1.0  4.0  NaN   NaN
1  2.0  5.0  NaN   NaN
2  3.0  6.0  NaN   NaN
0  NaN  NaN  7.0  10.0
1  NaN  NaN  8.0  11.0
2  NaN  NaN  9.0  12.0

Creates 4 columns. Whereas, you wanted only two. Try,

df2.columns = df.columns
pd.concat([df, df2], axis=0, sort=False)

   A   B
0  1   4
1  2   5
2  3   6
0  7  10
1  8  11
2  9  12

Which works as expected.

Upvotes: 2

Related Questions