Reputation: 163
I have a time series, where one of the columns contains mostly NaN. The dataframe is quite large, so handling all these NaNs is becoming a burden. If I simply drop all of them, plots are getting completely messed up with interpolation between remaining data points.
Here's a short example of what I have:
v x
0.0000 0.000000 NaN
0.0002 0.062791 NaN
0.0004 0.125333 NaN
0.0006 0.187381 95.0
0.0008 0.248690 NaN
0.0010 0.309017 NaN
0.0012 0.368125 NaN
0.0014 0.425779 88.0
0.0016 0.481754 85.0
0.0018 0.535827 91.0
0.0020 0.587785 NaN
0.0022 0.637424 NaN
0.0024 0.684547 NaN
0.0026 0.728969 99.0
...
and what I want to achieve:
v x
0.0004 0.125333 NaN
0.0006 0.187381 95.0
0.0012 0.368125 NaN
0.0014 0.425779 88.0
0.0016 0.481754 85.0
0.0018 0.535827 91.0
0.0024 0.684547 NaN
0.0026 0.728969 99.0
...
Iterating through rows is not an option, as it will be way to slow, but I can't figure out any reasonable approach. Any ideas?
A dataset example (rather short) to work with:
import pandas as pd
import numpy as np
f=50
Tmax = 1
fs= 5000
df = pd.DataFrame(index=np.arange(0, Tmax, 1/fs), data={'x':np.random.randint(0,100, size=int(fs*Tmax))})
df['v'] = np.sin(2*np.pi*f*df.index)
# Most of "x" is NaN
df.loc[df['x']<75, 'x'] = np.NaN
Upvotes: 2
Views: 196
Reputation: 862481
Use boolean indexing
with compare by Series.shift
ed and Series.notna
and chain by |
for bitwise OR:
df = df[df.x.shift(-1).notna() | df.x.notna()]
print (df)
v x
0.0004 0.125333 NaN
0.0006 0.187381 95.0
0.0012 0.368125 NaN
0.0014 0.425779 88.0
0.0016 0.481754 85.0
0.0018 0.535827 91.0
0.0024 0.684547 NaN
0.0026 0.728969 99.0
Upvotes: 3
Reputation: 989
Create a new variable temp by using shift method on x.
df['temp'] = df.x.shift(-1)
Then filter the rows where either x or temp is not null.
df[(~df.x.isnull())|(~df.temp.isnull())]
Since this approach uses inbuilt functions & filtering rather than loops, it should be faster.
Upvotes: 2