Sarah
Sarah

Reputation: 413

Data Frame in Panda with Time series data

I just started learning pandas. I came across this;

d = date_range('1/1/2011', periods=72, freq='H')
s = Series(randn(len(rng)), index=rng)

I have understood what is the above data means and I tried with IPython:

import numpy as np
from numpy.random import randn
import time
r = date_range('1/1/2011', periods=72, freq='H')
r
len(r)
[r[i] for i in range(len(r))]
s = Series(randn(len(r)), index=r)
s
s.plot()
df_new = DataFrame(data = s, columns=['Random Number Generated'])

Is it correct way of creating a data frame?

The Next step given is to : Return a series where the absolute difference between a number and the next number in the series is less than 0.5

Do I need to find the difference between each random number generated and store only the sets where the abs diff is < 0.5 ? Can someone explain how can I do that in pandas?

Also I tried to plot the series as histogram with;

 df_new.diff().hist()

The graph display the x as Random number with Y axis 0 to 18 (which I don't understand). Can some one explain this to me as well?

Upvotes: 0

Views: 133

Answers (1)

Stefan
Stefan

Reputation: 42885

To give you some pointers in addition to @Dthal's comments:

r = pd.date_range('1/1/2011', periods=72, freq='H')

As commented by @Dthal, you can simplify the creation of your DataFrame randomly sampled from the normal distribution like so:

df = pd.DataFrame(index=r, data=randn(len(r)), columns=['Random Number Generated'])

To show only values that differ by less than 0.5 from the preceding value:

diff = df.diff()
diff[abs(diff['Random Number Generated']) < 0.5]

                     Random Number Generated
2011-01-01 02:00:00                 0.061821
2011-01-01 05:00:00                 0.463712
2011-01-01 09:00:00                -0.402802
2011-01-01 11:00:00                -0.000434
2011-01-01 22:00:00                 0.295019
2011-01-02 03:00:00                 0.215095
2011-01-02 05:00:00                 0.424368
2011-01-02 08:00:00                -0.452416
2011-01-02 09:00:00                -0.474999
2011-01-02 11:00:00                 0.385204
2011-01-02 12:00:00                -0.248396
2011-01-02 14:00:00                 0.081890
2011-01-02 17:00:00                 0.421897
2011-01-02 18:00:00                 0.104898
2011-01-03 05:00:00                -0.071969
2011-01-03 15:00:00                 0.101156
2011-01-03 18:00:00                -0.175296
2011-01-03 20:00:00                -0.371812

Can simplify using .dropna() to get rid of the missing values.

The pandas.Series.hist() docs inform that the default number of bins is 10, so that's number of bars you should expect and so it turns out in this case roughly symmetric around zero ranging roughly [-4, +4].

Series.hist(by=None, ax=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, figsize=None, bins=10, **kwds) diff.hist()

enter image description here

Upvotes: 1

Related Questions