weskpga
weskpga

Reputation: 2087

Create new columns while appending to pandas dataframe

I have a dataframe with ~12,000 columns and a few hundred rows like the below:

Date     id1    id2    id3    id4    ...    id12000
1/4/16   100    78     103.5  15            35
1/5/16   100.5  78     104    16            37
1/6/16   99.5   78.5   104.5  16.5          37
1/7/16   100.1  77     104.2  13            37
1/8/16   100.3  76.5   104.2  13.5          33
1/11/16  99.8   77     103.7  13.5          34
...
5/1/17   101.1  82     103.6  12            33

Every day, I get a new dataframe that I need to add to this dataframe. So, for example, I may get a dataframe for May 2nd that looks like the below:

Date     id1    id3    id4    id6    ...    id12100
5/2/17   100    103    12.5   15            100

But that dataframe will likely contain a few new ID's (columns) every single day, and it may also have a few of the ID's omitted that were previously there.

I would like my new dataframe to add columns every time there is a new ID, and fill all previous values of that column with NaN's. I'd also like any cells of ID's that were previously in the dataframe to be filled with NaN's. So, for example, the joining of the two above example dataframes would look like:

Date     id1    id2    id3    id4    ...    id12000   id12100
1/4/16   100    78     103.5  15            35        NaN
1/5/16   100.5  78     104    16            37        NaN
1/6/16   99.5   78.5   104.5  16.5          37        NaN
1/7/16   100.1  77     104.2  13            37        NaN
1/8/16   100.3  76.5   104.2  13.5          33        NaN
1/11/16  99.8   77     103.7  13.5          34        NaN
...
5/1/17   101.1  82     103.6  12            33        NaN
5/2/17   100    NaN    103    12.5          NaN       100

However, if I run this on a date that's already in the dataframe, I'd like to simply replace the old value with the new value.

I've been wrangling with this problem for a while now and can't quite figure out how I need to do it, between properly formatting the dates so that I can check whether there is a duplicate in the dataframe, to properly appending/joining/merging/etc. the old values with the new ones. I have several different places where a generalized function would be really useful for me, so I'm hoping someone has had the same problem and has an elegant solution that I can use.

Thanks in advance!

Upvotes: 2

Views: 147

Answers (2)

Scott Boston
Scott Boston

Reputation: 153500

I think you need combine_first be sure Date is your index in both dataframes:

print(df)

           id1   id2    id3   id4  id12000
Date                                      
1/4/16   100.0  78.0  103.5  15.0       35
1/5/16   100.5  78.0  104.0  16.0       37
1/6/16    99.5  78.5  104.5  16.5       37
1/7/16   100.1  77.0  104.2  13.0       37
1/8/16   100.3  76.5  104.2  13.5       33
1/11/16   99.8  77.0  103.7  13.5       34
5/1/17   101.1  82.0  103.6  12.0       33

print(df_new)

        id1  id3   id4  id6  id12100
Date                                
5/2/17  100  103  12.5   15      100

Use combine_first with Date as index:

df_out = df_new.combine_first(df)
print(df_out)

Output:

           id1  id12000  id12100   id2    id3   id4   id6
Date                                                     
1/11/16   99.8     34.0      NaN  77.0  103.7  13.5   NaN
1/4/16   100.0     35.0      NaN  78.0  103.5  15.0   NaN
1/5/16   100.5     37.0      NaN  78.0  104.0  16.0   NaN
1/6/16    99.5     37.0      NaN  78.5  104.5  16.5   NaN
1/7/16   100.1     37.0      NaN  77.0  104.2  13.0   NaN
1/8/16   100.3     33.0      NaN  76.5  104.2  13.5   NaN
5/1/17   101.1     33.0      NaN  82.0  103.6  12.0   NaN
5/2/17   100.0      NaN    100.0   NaN  103.0  12.5  15.0

Upvotes: 2

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210882

In [43]: df
Out[43]:
     Date    id1   id2    id3   id4
0  1/4/16  100.0  78.0  103.5  15.0
1  1/5/16  100.5  78.0  104.0  16.0
2  1/6/16   99.5  78.5  104.5  16.5

In [44]: new
Out[44]:
     Date  id1  id3   id4  id6  id12100
0  1/6/16   11   11  11.0   11      111
1  5/2/17  100  103  12.5   15      100

In [45]: df.set_index('Date').append(new.set_index('Date'))
Out[45]:
          id1  id12100   id2    id3   id4   id6
Date
1/4/16  100.0      NaN  78.0  103.5  15.0   NaN
1/5/16  100.5      NaN  78.0  104.0  16.0   NaN
1/6/16   99.5      NaN  78.5  104.5  16.5   NaN
1/6/16   11.0    111.0   NaN   11.0  11.0  11.0
5/2/17  100.0    100.0   NaN  103.0  12.5  15.0

Upvotes: 3

Related Questions