gis20
gis20

Reputation: 1094

fulfill an empty dataframe with common index values from another Daframe

I have a daframe with a series of period 1 month and frequency one second.

The problem the time step between records is not always 1 second.

time                c1  c2
2013-01-01 00:00:01 5   3
2013-01-01 00:00:03 7   2
2013-01-01 00:00:04 1   5
2013-01-01 00:00:05 4   3
2013-01-01 00:00:06 5   6
2013-01-01 00:00:09 4   2
2013-01-01 00:00:10 7   8

Then I want to create an empty dataframe with the same columns and for the whole period corrected. That means with as many records as seconds has a month. This empty dataframe is fulfilled in principle with nan values:

time                c1  c2
2013-01-01 00:00:01 nan nan
2013-01-01 00:00:02 nan nan
2013-01-01 00:00:03 nan nan
2013-01-01 00:00:04 nan nan
2013-01-01 00:00:05 nan nan
2013-01-01 00:00:06 nan nan
2013-01-01 00:00:07 nan nan
2013-01-01 00:00:08 nan nan
2013-01-01 00:00:09 nan nan
2013-01-01 00:00:10 nan nan

Then compare both, and fulfill the empty one, with the common rows with my first dataframe. The non-common should remain with nan values.

time                c1  c2
2013-01-01 00:00:01 5   3
2013-01-01 00:00:02 nan nan
2013-01-01 00:00:03 7   2
2013-01-01 00:00:04 1   5
2013-01-01 00:00:05 4   3
2013-01-01 00:00:06 5   6
2013-01-01 00:00:07 nan nan
2013-01-01 00:00:08 nan nan
2013-01-01 00:00:09 4   2
2013-01-01 00:00:10 7   8

My try:

#Read from a file the first dataframe
df1=pd.read_table(fin,parse_dates=0],names=ch,index_col=0,header=0,decimal='.',skiprows=c)
#create an empty dataframe 
N=86400 * 31#seconds per month
index=pd.date_range(df1.index[0], periods=N-1, freq='1s')
df2=pd.DataFrame(index=index, columns=df1.columns)

Now I try with merge or concat but without the expected result:

df2.merge(df1, how='outer')
pd.concat([df2,df1], axis=0, join='outer')

Upvotes: 0

Views: 382

Answers (3)

ursan
ursan

Reputation: 2447

If you already set up the indexes in the "nan" data frame, I think you should be able to just use loc. Indexing is a really important thing to master when using Pandas. It will save you a whole lot of time, make your code a lot cleaner and can really improve your performance.

Careful though, the indexes and columns have to be the same for the trick below to work as is.

>>> import pandas as pd
>>> import numpy as np

>>> df1 = pd.DataFrame(np.random.rand(10, 3), columns=['A', 'B', 'C'])
>>> df1
          A         B         C
0  0.171502  0.258416  0.118326
1  0.215456  0.462122  0.858173
2  0.373549  0.946400  0.579845
3  0.606289  0.289552  0.473658
4  0.885899  0.783747  0.089975
5  0.674208  0.639710  0.105642
6  0.404775  0.541389  0.268101
7  0.374609  0.693916  0.743575
8  0.074773  0.150072  0.135555
9  0.230431  0.202417  0.466538

>>> df2 = pd.DataFrame(np.nan, index=range(15), columns=['A', 'B', 'C'])
>>> df2
     A   B   C
0  NaN NaN NaN
1  NaN NaN NaN
2  NaN NaN NaN
3  NaN NaN NaN
4  NaN NaN NaN
5  NaN NaN NaN
6  NaN NaN NaN
7  NaN NaN NaN
8  NaN NaN NaN
9  NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
13 NaN NaN NaN
14 NaN NaN NaN

>>> df2.loc[df1.index] = df1    # This is where the magic happens
>>> df2
           A         B         C
0   0.171502  0.258416  0.118326
1   0.215456  0.462122  0.858173
2   0.373549  0.946400  0.579845
3   0.606289  0.289552  0.473658
4   0.885899  0.783747  0.089975
5   0.674208  0.639710  0.105642
6   0.404775  0.541389  0.268101
7   0.374609  0.693916  0.743575
8   0.074773  0.150072  0.135555
9   0.230431  0.202417  0.466538
10       NaN       NaN       NaN
11       NaN       NaN       NaN
12       NaN       NaN       NaN
13       NaN       NaN       NaN
14       NaN       NaN       NaN

Upvotes: 0

user2285236
user2285236

Reputation:

I don't think you need a second dataframe. If you call resample without a fill_method, it will store NaNs for the missing periods:

df.resample("s").max()
Out[62]: 
                      c1   c2
time                         
2013-01-01 00:00:01  5.0  3.0
2013-01-01 00:00:02  NaN  NaN
2013-01-01 00:00:03  7.0  2.0
2013-01-01 00:00:04  1.0  5.0
2013-01-01 00:00:05  4.0  3.0
2013-01-01 00:00:06  5.0  6.0
2013-01-01 00:00:07  NaN  NaN
2013-01-01 00:00:08  NaN  NaN
2013-01-01 00:00:09  4.0  2.0
2013-01-01 00:00:10  7.0  8.0

max() here is just an arbitrary method so that it returns a dataframe. You can replace it with mean, min etc. assuming you have no duplicates. If you have duplicates, they will be aggregated by that function.

As Paul H suggested in the comments, you can use df.resample("s").asfreq() without any aggregation. It skips an unnecessary step of aggregation so it is probably more efficient. It will raise an error if you have duplicate values in the index.

Upvotes: 1

Paul H
Paul H

Reputation: 68156

You need to reindex the dataframe.

import pandas
df = pandas.read_table(filename, **options)
N = 86400 * 31 #seconds per month
dates = pandas.date_range(df.index[0], periods=N-1, freq='1s')
df = df.reindex(dates)

Here's a reproducible demonstration:

df = pandas.DataFrame(
    data={'A': range(0, 10), 'B': range(0, 20, 2)},
    index=pandas.date_range('2012-01-01', freq='2s', periods=10)
).reindex(pandas.date_range('2012-01-01', freq='1s', periods=25))

print(df)

                       A     B
2012-01-01 00:00:00  0.0   0.0
2012-01-01 00:00:01  NaN   NaN
2012-01-01 00:00:02  1.0   2.0
2012-01-01 00:00:03  NaN   NaN
2012-01-01 00:00:04  2.0   4.0
2012-01-01 00:00:05  NaN   NaN
2012-01-01 00:00:06  3.0   6.0
2012-01-01 00:00:07  NaN   NaN
2012-01-01 00:00:08  4.0   8.0
2012-01-01 00:00:09  NaN   NaN
2012-01-01 00:00:10  5.0  10.0
2012-01-01 00:00:11  NaN   NaN
2012-01-01 00:00:12  6.0  12.0
2012-01-01 00:00:13  NaN   NaN
2012-01-01 00:00:14  7.0  14.0
2012-01-01 00:00:15  NaN   NaN
2012-01-01 00:00:16  8.0  16.0
2012-01-01 00:00:17  NaN   NaN
2012-01-01 00:00:18  9.0  18.0
2012-01-01 00:00:19  NaN   NaN
2012-01-01 00:00:20  NaN   NaN
2012-01-01 00:00:21  NaN   NaN
2012-01-01 00:00:22  NaN   NaN
2012-01-01 00:00:23  NaN   NaN
2012-01-01 00:00:24  NaN   NaN

Upvotes: 0

Related Questions