Reputation: 3379
What is the best way in Pandas to calculate a time lag of a list of variables within spatial units?
Let's say I have a Pandas dataframe that looks like this:
>>> MyData=pd.DataFrame(columns=['Year','Location','Var1','Var2'], data=[[2001, 'A',1,2],[2002, 'A',3,4],[2003, 'A',5,6],[2001, 'B',7,8],[2002, 'B',9,10],[2003, 'B',11,12]])
>>> print MyData
Year Location Var1 Var2
0 2001 A 1 2
1 2002 A 3 4
2 2003 A 5 6
3 2001 B 7 8
4 2002 B 9 10
5 2003 B 11 12
>>>
I want a new dataframe with each location taking on the Var1 and Var2 values of the previous year. The dataframe will include 1 less year because you can't lag the first year.
I wrote a function that does this, but it feels a little clunky and I'm wondering if there are ways to improve it or do it more efficiently. Any suggestions on how my code can be improved would be greatly appreciated.
>>> def TimeLag(df, TimeVar, LocationVar, LagVars):
... AllPeriods=sorted(df[TimeVar].unique().tolist())
... ResultList=[]
... for Period in AllPeriods[1:]:
... d=df[df[TimeVar]==Period][[TimeVar, LocationVar]]
... dtlag=df[df[TimeVar]==Period-1][[LocationVar]+LagVars]
... d=pd.merge(d, dtlag, on='Location')
... d=d.rename(columns={Var:Var+'_tlag' for Var in LagVars})
... ResultList.append(d)
... Final=pd.concat(ResultList).reset_index(drop=True)
... return Final
...
>>>
>>> print TimeLag(MyData, 'Year','Location', ['Var1','Var2'])
Year Location Var1_tlag Var2_tlag
0 2002 A 1 2
1 2002 B 7 8
2 2003 A 3 4
3 2003 B 9 10
>>>
Upvotes: 0
Views: 1869
Reputation: 5414
def lag_year(df):
shifted_years = df.shift().dropna(how = 'all')
years = df.iloc[1:]
final = pd.DataFrame({'Year' : years.Year , 'Var1' : shifted_years.Var1 , 'Var2' : shifted_years.Var2})
return final
MyData.groupby('Location').apply(lag_year).reset_index().drop('level_1' , axis = 1)
Location Var1 Var2 Year
A 1 2 2002
A 3 4 2003
B 7 8 2002
B 9 10 2003
if you want you can sort by year
MyData.groupby('Location').apply(lag_year).reset_index().drop('level_1' , axis = 1).sort_index(by = 'Year')
Upvotes: 1
Reputation: 394051
OK, after looking at what you want it'll be quicker to take the first rows and then shift
the rest of the df after performing a groupby
and then concat
:
In [77]:
first = MyData[MyData['Year']==2001]
first
Out[77]:
Year Location Var1 Var2
0 2001 A 1 2
3 2001 B 7 8
In [80]:
rest = MyData[MyData['Year']!=2001]
rest = rest.groupby('Location',as_index=False).shift().dropna()
pd.concat([first,rest])
Out[80]:
Year Location Var1 Var2
0 2001 A 1 2
3 2001 B 7 8
2 2002 A 3 4
5 2002 B 9 10
Upvotes: 2