Fonti
Fonti

Reputation: 1259

Chosing a different value for NaN entries from appending DataFrames with different columns

I am concatenating multiple months of csv's where newer, more recent versions have additional columns. As a result, putting them all together fills certain rows of certain columns with NaN.

The issue with this behavior is that it mixes these NaNs with true null entries from the data set which need to be easily distinguishable.

My only solution as of now is to replace the original NaNs with a unique string, concatenate the csv's, replace the new NaNs with a second unique string, replace the first unique string with NaN.

Given the amount of data I am processing, this is a very inefficient solution. I thought there was some way to determine how Panda's DataFrame fill these entries but couldn't find anything on it.

Updated example:

A B  
1 NaN  
2 3 

And append

A B C  
1 2 3  

Gives

A B C  
1 NaN NaN  
2 3 NaN  
1 2 3  

But I want

A B C  
1 NaN 'predated'  
2 3 'predated'  
1 2 3  

Upvotes: 1

Views: 48

Answers (1)

Stefan
Stefan

Reputation: 42875

In case you have a core set of columns, as here represented by df1, you could apply .fillna() to the .difference() between the core set and any new columns in more recent DataFrames.

df1 = pd.DataFrame(data={'A': [1, 2], 'B': [np.nan, 3]})

   A   B
0  1 NaN
1  2   3

df2 = pd.DataFrame(data={'A': 1, 'B': 2, 'C': 3}, index=[0])

   A  B  C
0  1  2  3

df = pd.concat([df1, df2], ignore_index=True)

new_cols = df2.columns.difference(df1.columns).tolist()
df[new_cols] = df[new_cols].fillna(value='predated')

   A   B         C
0  1 NaN  predated
1  2   3  predated
2  1   2         3

Upvotes: 2

Related Questions