AJG519
AJG519

Reputation: 3379

Time lag of variables in Pandas

What is the best way in Pandas to calculate a time lag of a list of variables within spatial units?

Let's say I have a Pandas dataframe that looks like this:

>>> MyData=pd.DataFrame(columns=['Year','Location','Var1','Var2'], data=[[2001, 'A',1,2],[2002, 'A',3,4],[2003, 'A',5,6],[2001, 'B',7,8],[2002, 'B',9,10],[2003, 'B',11,12]])

>>> print MyData
   Year Location  Var1  Var2
0  2001        A     1     2
1  2002        A     3     4
2  2003        A     5     6
3  2001        B     7     8
4  2002        B     9    10
5  2003        B    11    12
>>> 

I want a new dataframe with each location taking on the Var1 and Var2 values of the previous year. The dataframe will include 1 less year because you can't lag the first year.

I wrote a function that does this, but it feels a little clunky and I'm wondering if there are ways to improve it or do it more efficiently. Any suggestions on how my code can be improved would be greatly appreciated.

>>> def TimeLag(df, TimeVar, LocationVar, LagVars):
...     AllPeriods=sorted(df[TimeVar].unique().tolist())
...     ResultList=[]
...     for Period in AllPeriods[1:]:      
...         d=df[df[TimeVar]==Period][[TimeVar, LocationVar]]        
...         dtlag=df[df[TimeVar]==Period-1][[LocationVar]+LagVars]
...         d=pd.merge(d, dtlag, on='Location')        
...         d=d.rename(columns={Var:Var+'_tlag' for Var in LagVars})
...         ResultList.append(d)        
...     Final=pd.concat(ResultList).reset_index(drop=True)     
...     return Final        
... 
>>> 
>>> print TimeLag(MyData, 'Year','Location', ['Var1','Var2'])
   Year Location  Var1_tlag  Var2_tlag
0  2002        A          1          2
1  2002        B          7          8
2  2003        A          3          4
3  2003        B          9         10
>>> 

Upvotes: 0

Views: 1869

Answers (2)

Nader Hisham
Nader Hisham

Reputation: 5414

def lag_year(df):
    shifted_years = df.shift().dropna(how = 'all')
    years = df.iloc[1:]
    final = pd.DataFrame({'Year' : years.Year  , 'Var1' : shifted_years.Var1 , 'Var2' : shifted_years.Var2})
    return final

MyData.groupby('Location').apply(lag_year).reset_index().drop('level_1' , axis = 1)

Location    Var1    Var2    Year
    A         1        2    2002
    A         3        4    2003
    B         7        8    2002
    B         9       10    2003

if you want you can sort by year

MyData.groupby('Location').apply(lag_year).reset_index().drop('level_1' , axis = 1).sort_index(by = 'Year')

Upvotes: 1

EdChum
EdChum

Reputation: 394051

OK, after looking at what you want it'll be quicker to take the first rows and then shift the rest of the df after performing a groupby and then concat:

In [77]:
first = MyData[MyData['Year']==2001]
first

Out[77]:
   Year Location  Var1  Var2
0  2001        A     1     2
3  2001        B     7     8

In [80]:    
rest = MyData[MyData['Year']!=2001]
rest = rest.groupby('Location',as_index=False).shift().dropna()
pd.concat([first,rest])
Out[80]:
   Year Location  Var1  Var2
0  2001        A     1     2
3  2001        B     7     8
2  2002        A     3     4
5  2002        B     9    10

Upvotes: 2

Related Questions