Reputation: 1094
I have a daframe with a series of period 1 month and frequency one second.
The problem the time step between records is not always 1 second.
time c1 c2
2013-01-01 00:00:01 5 3
2013-01-01 00:00:03 7 2
2013-01-01 00:00:04 1 5
2013-01-01 00:00:05 4 3
2013-01-01 00:00:06 5 6
2013-01-01 00:00:09 4 2
2013-01-01 00:00:10 7 8
Then I want to create an empty dataframe with the same columns and for the whole period corrected. That means with as many records as seconds has a month. This empty dataframe is fulfilled in principle with nan values:
time c1 c2
2013-01-01 00:00:01 nan nan
2013-01-01 00:00:02 nan nan
2013-01-01 00:00:03 nan nan
2013-01-01 00:00:04 nan nan
2013-01-01 00:00:05 nan nan
2013-01-01 00:00:06 nan nan
2013-01-01 00:00:07 nan nan
2013-01-01 00:00:08 nan nan
2013-01-01 00:00:09 nan nan
2013-01-01 00:00:10 nan nan
Then compare both, and fulfill the empty one, with the common rows with my first dataframe. The non-common should remain with nan values.
time c1 c2
2013-01-01 00:00:01 5 3
2013-01-01 00:00:02 nan nan
2013-01-01 00:00:03 7 2
2013-01-01 00:00:04 1 5
2013-01-01 00:00:05 4 3
2013-01-01 00:00:06 5 6
2013-01-01 00:00:07 nan nan
2013-01-01 00:00:08 nan nan
2013-01-01 00:00:09 4 2
2013-01-01 00:00:10 7 8
My try:
#Read from a file the first dataframe
df1=pd.read_table(fin,parse_dates=0],names=ch,index_col=0,header=0,decimal='.',skiprows=c)
#create an empty dataframe
N=86400 * 31#seconds per month
index=pd.date_range(df1.index[0], periods=N-1, freq='1s')
df2=pd.DataFrame(index=index, columns=df1.columns)
Now I try with merge or concat but without the expected result:
df2.merge(df1, how='outer')
pd.concat([df2,df1], axis=0, join='outer')
Upvotes: 0
Views: 382
Reputation: 2447
If you already set up the indexes in the "nan" data frame, I think you should be able to just use loc
. Indexing is a really important thing to master when using Pandas. It will save you a whole lot of time, make your code a lot cleaner and can really improve your performance.
Careful though, the indexes and columns have to be the same for the trick below to work as is.
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.DataFrame(np.random.rand(10, 3), columns=['A', 'B', 'C'])
>>> df1
A B C
0 0.171502 0.258416 0.118326
1 0.215456 0.462122 0.858173
2 0.373549 0.946400 0.579845
3 0.606289 0.289552 0.473658
4 0.885899 0.783747 0.089975
5 0.674208 0.639710 0.105642
6 0.404775 0.541389 0.268101
7 0.374609 0.693916 0.743575
8 0.074773 0.150072 0.135555
9 0.230431 0.202417 0.466538
>>> df2 = pd.DataFrame(np.nan, index=range(15), columns=['A', 'B', 'C'])
>>> df2
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
13 NaN NaN NaN
14 NaN NaN NaN
>>> df2.loc[df1.index] = df1 # This is where the magic happens
>>> df2
A B C
0 0.171502 0.258416 0.118326
1 0.215456 0.462122 0.858173
2 0.373549 0.946400 0.579845
3 0.606289 0.289552 0.473658
4 0.885899 0.783747 0.089975
5 0.674208 0.639710 0.105642
6 0.404775 0.541389 0.268101
7 0.374609 0.693916 0.743575
8 0.074773 0.150072 0.135555
9 0.230431 0.202417 0.466538
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
13 NaN NaN NaN
14 NaN NaN NaN
Upvotes: 0
Reputation:
I don't think you need a second dataframe. If you call resample
without a fill_method
, it will store NaN
s for the missing periods:
df.resample("s").max()
Out[62]:
c1 c2
time
2013-01-01 00:00:01 5.0 3.0
2013-01-01 00:00:02 NaN NaN
2013-01-01 00:00:03 7.0 2.0
2013-01-01 00:00:04 1.0 5.0
2013-01-01 00:00:05 4.0 3.0
2013-01-01 00:00:06 5.0 6.0
2013-01-01 00:00:07 NaN NaN
2013-01-01 00:00:08 NaN NaN
2013-01-01 00:00:09 4.0 2.0
2013-01-01 00:00:10 7.0 8.0
max()
here is just an arbitrary method so that it returns a dataframe. You can replace it with mean, min etc. assuming you have no duplicates. If you have duplicates, they will be aggregated by that function.
As Paul H suggested in the comments, you can use df.resample("s").asfreq()
without any aggregation. It skips an unnecessary step of aggregation so it is probably more efficient. It will raise an error if you have duplicate values in the index.
Upvotes: 1
Reputation: 68156
You need to reindex
the dataframe.
import pandas
df = pandas.read_table(filename, **options)
N = 86400 * 31 #seconds per month
dates = pandas.date_range(df.index[0], periods=N-1, freq='1s')
df = df.reindex(dates)
Here's a reproducible demonstration:
df = pandas.DataFrame(
data={'A': range(0, 10), 'B': range(0, 20, 2)},
index=pandas.date_range('2012-01-01', freq='2s', periods=10)
).reindex(pandas.date_range('2012-01-01', freq='1s', periods=25))
print(df)
A B
2012-01-01 00:00:00 0.0 0.0
2012-01-01 00:00:01 NaN NaN
2012-01-01 00:00:02 1.0 2.0
2012-01-01 00:00:03 NaN NaN
2012-01-01 00:00:04 2.0 4.0
2012-01-01 00:00:05 NaN NaN
2012-01-01 00:00:06 3.0 6.0
2012-01-01 00:00:07 NaN NaN
2012-01-01 00:00:08 4.0 8.0
2012-01-01 00:00:09 NaN NaN
2012-01-01 00:00:10 5.0 10.0
2012-01-01 00:00:11 NaN NaN
2012-01-01 00:00:12 6.0 12.0
2012-01-01 00:00:13 NaN NaN
2012-01-01 00:00:14 7.0 14.0
2012-01-01 00:00:15 NaN NaN
2012-01-01 00:00:16 8.0 16.0
2012-01-01 00:00:17 NaN NaN
2012-01-01 00:00:18 9.0 18.0
2012-01-01 00:00:19 NaN NaN
2012-01-01 00:00:20 NaN NaN
2012-01-01 00:00:21 NaN NaN
2012-01-01 00:00:22 NaN NaN
2012-01-01 00:00:23 NaN NaN
2012-01-01 00:00:24 NaN NaN
Upvotes: 0