Liedakkala
Liedakkala

Reputation: 428

Extract values from a column based on the equality of two other columns in Pandas

I'm new to Pandas and coming from a SQL/SAS background. I'm trying to take financial information from a prior period and merge it onto my DataFrame. Here's a data example:

np.random.seed(0)
df = pd.DataFrame({
    'company': pd.Series(['A','A','A','B','B','B','C','C','C']),
    'year': pd.Series([1,2,3,1,3,4,2,3,4]),
    'quarter': pd.Series([2,2,2,3,4,4,1,1,3]),
    'amt': np.random.randn(9)
})

What I want is to take amt from the prior year when company and quarter match. Here's what I would do in SQL:

SELECT a.*, b.amt as amt_prior
FROM df as a
LEFT JOIN df as b
ON a.company = b.company
    AND a.quarter = b.quarter
    AND a.year = (b.year + 1);

What I have come up with is the following. It works but it seems like I'm not doing something the Pandas way. Is there a better way for me to do this?

df2 = pd.merge(
    df, df, 
    how='left', 
    left_on=['company', 'quarter', 'year'], 
    right_on=['company', 'quarter', np.array(df['year']) + 1], 
    suffixes=('', '_prior')
)

Upvotes: 0

Views: 77

Answers (2)

cs95
cs95

Reputation: 402483

You can use a groupby and transform + shift:

np.random.seed(0)

df
        amt company  quarter  year
0  1.764052       A        2     1
1  0.400157       A        2     2
2  0.978738       A        2     3
3  2.240893       B        3     1
4  1.867558       B        4     3
5 -0.977278       B        4     4
6  0.950088       C        1     2
7 -0.151357       C        1     3
8 -0.103219       C        3     4

priors = df.groupby(['company', 'quarter'])\
                       .shift().add_suffix('_prior')

priors
   amt_prior  year_prior
0        NaN         NaN
1   1.764052         1.0
2   0.400157         2.0
3        NaN         NaN
4        NaN         NaN
5   1.867558         3.0
6        NaN         NaN
7   0.950088         2.0
8        NaN         NaN

Now, concatenate priors with the original:

df = pd.concat([df, priors], 1)

df
        amt company  quarter  year  amt_prior  year_prior
0  1.764052       A        2     1        NaN         NaN
1  0.400157       A        2     2   1.764052         1.0
2  0.978738       A        2     3   0.400157         2.0
3  2.240893       B        3     1        NaN         NaN
4  1.867558       B        4     3        NaN         NaN
5 -0.977278       B        4     4   1.867558         3.0
6  0.950088       C        1     2        NaN         NaN
7 -0.151357       C        1     3   0.950088         2.0
8 -0.103219       C        3     4        NaN         NaN

This matches with your expected output df2.

You can also get the same answer using Vaishali's solution, but using 2 groupbys for the same operation is inefficient.

Upvotes: 2

Vaishali
Vaishali

Reputation: 38415

You can use

df['amt_prior'] = df.groupby(['company', 'quarter']).amt.shift()
df['year_prior'] = df.groupby(['company', 'quarter']).year.shift()


    amt company quarter year    amt_prior   year_prior
0   -1.309558   A   2   1       NaN         NaN
1   0.579060    A   2   2       -1.309558   1.0
2   -0.373324   A   2   3       0.579060    2.0
3   -0.515989   B   3   1       NaN         NaN
4   1.013687    B   4   3       NaN         NaN
5   1.419462    B   4   4       1.013687    3.0
6   -0.918983   C   1   2       NaN         NaN
7   0.121057    C   1   3       -0.918983   2.0
8   -0.188670   C   3   4       NaN         NaN

Upvotes: 3

Related Questions