Reputation: 2087
I have a dataframe with ~12,000 columns and a few hundred rows like the below:
Date id1 id2 id3 id4 ... id12000
1/4/16 100 78 103.5 15 35
1/5/16 100.5 78 104 16 37
1/6/16 99.5 78.5 104.5 16.5 37
1/7/16 100.1 77 104.2 13 37
1/8/16 100.3 76.5 104.2 13.5 33
1/11/16 99.8 77 103.7 13.5 34
...
5/1/17 101.1 82 103.6 12 33
Every day, I get a new dataframe that I need to add to this dataframe. So, for example, I may get a dataframe for May 2nd that looks like the below:
Date id1 id3 id4 id6 ... id12100
5/2/17 100 103 12.5 15 100
But that dataframe will likely contain a few new ID's (columns) every single day, and it may also have a few of the ID's omitted that were previously there.
I would like my new dataframe to add columns every time there is a new ID, and fill all previous values of that column with NaN
's. I'd also like any cells of ID's that were previously in the dataframe to be filled with NaN
's. So, for example, the joining of the two above example dataframes would look like:
Date id1 id2 id3 id4 ... id12000 id12100
1/4/16 100 78 103.5 15 35 NaN
1/5/16 100.5 78 104 16 37 NaN
1/6/16 99.5 78.5 104.5 16.5 37 NaN
1/7/16 100.1 77 104.2 13 37 NaN
1/8/16 100.3 76.5 104.2 13.5 33 NaN
1/11/16 99.8 77 103.7 13.5 34 NaN
...
5/1/17 101.1 82 103.6 12 33 NaN
5/2/17 100 NaN 103 12.5 NaN 100
However, if I run this on a date that's already in the dataframe, I'd like to simply replace the old value with the new value.
I've been wrangling with this problem for a while now and can't quite figure out how I need to do it, between properly formatting the dates so that I can check whether there is a duplicate in the dataframe, to properly appending/joining/merging/etc. the old values with the new ones. I have several different places where a generalized function would be really useful for me, so I'm hoping someone has had the same problem and has an elegant solution that I can use.
Thanks in advance!
Upvotes: 2
Views: 147
Reputation: 153500
I think you need combine_first
be sure Date is your index in both dataframes:
print(df)
id1 id2 id3 id4 id12000
Date
1/4/16 100.0 78.0 103.5 15.0 35
1/5/16 100.5 78.0 104.0 16.0 37
1/6/16 99.5 78.5 104.5 16.5 37
1/7/16 100.1 77.0 104.2 13.0 37
1/8/16 100.3 76.5 104.2 13.5 33
1/11/16 99.8 77.0 103.7 13.5 34
5/1/17 101.1 82.0 103.6 12.0 33
print(df_new)
id1 id3 id4 id6 id12100
Date
5/2/17 100 103 12.5 15 100
Use combine_first
with Date as index:
df_out = df_new.combine_first(df)
print(df_out)
Output:
id1 id12000 id12100 id2 id3 id4 id6
Date
1/11/16 99.8 34.0 NaN 77.0 103.7 13.5 NaN
1/4/16 100.0 35.0 NaN 78.0 103.5 15.0 NaN
1/5/16 100.5 37.0 NaN 78.0 104.0 16.0 NaN
1/6/16 99.5 37.0 NaN 78.5 104.5 16.5 NaN
1/7/16 100.1 37.0 NaN 77.0 104.2 13.0 NaN
1/8/16 100.3 33.0 NaN 76.5 104.2 13.5 NaN
5/1/17 101.1 33.0 NaN 82.0 103.6 12.0 NaN
5/2/17 100.0 NaN 100.0 NaN 103.0 12.5 15.0
Upvotes: 2
Reputation: 210882
In [43]: df
Out[43]:
Date id1 id2 id3 id4
0 1/4/16 100.0 78.0 103.5 15.0
1 1/5/16 100.5 78.0 104.0 16.0
2 1/6/16 99.5 78.5 104.5 16.5
In [44]: new
Out[44]:
Date id1 id3 id4 id6 id12100
0 1/6/16 11 11 11.0 11 111
1 5/2/17 100 103 12.5 15 100
In [45]: df.set_index('Date').append(new.set_index('Date'))
Out[45]:
id1 id12100 id2 id3 id4 id6
Date
1/4/16 100.0 NaN 78.0 103.5 15.0 NaN
1/5/16 100.5 NaN 78.0 104.0 16.0 NaN
1/6/16 99.5 NaN 78.5 104.5 16.5 NaN
1/6/16 11.0 111.0 NaN 11.0 11.0 11.0
5/2/17 100.0 100.0 NaN 103.0 12.5 15.0
Upvotes: 3