Reputation: 1259
I am concatenating multiple months of csv's
where newer, more recent versions have additional columns. As a result, putting them all together fills certain rows of certain columns with NaN
.
The issue with this behavior is that it mixes these NaN
s with true null entries from the data set which need to be easily distinguishable.
My only solution as of now is to replace the original NaNs with a unique string, concatenate the csv's, replace the new NaNs with a second unique string, replace the first unique string with NaN.
Given the amount of data I am processing, this is a very inefficient solution. I thought there was some way to determine how Panda's DataFrame
fill these entries but couldn't find anything on it.
Updated example:
A B
1 NaN
2 3
And append
A B C
1 2 3
Gives
A B C
1 NaN NaN
2 3 NaN
1 2 3
But I want
A B C
1 NaN 'predated'
2 3 'predated'
1 2 3
Upvotes: 1
Views: 48
Reputation: 42875
In case you have a core set of columns, as here represented by df1
, you could apply .fillna()
to the .difference()
between the core set and any new columns in more recent DataFrames
.
df1 = pd.DataFrame(data={'A': [1, 2], 'B': [np.nan, 3]})
A B
0 1 NaN
1 2 3
df2 = pd.DataFrame(data={'A': 1, 'B': 2, 'C': 3}, index=[0])
A B C
0 1 2 3
df = pd.concat([df1, df2], ignore_index=True)
new_cols = df2.columns.difference(df1.columns).tolist()
df[new_cols] = df[new_cols].fillna(value='predated')
A B C
0 1 NaN predated
1 2 3 predated
2 1 2 3
Upvotes: 2