Tom
Tom

Reputation: 2341

Attributing the difference of a variable in different years in a new column based on a certain condition

I have a dataframe which is called df, which looks like the following:

country     year    taxrevenue  VAT  CIT
Angola      2006    1000        750  250
Albania     1998    1000        750  250
Albania     2002    1500        1125 375
Albania     2005    1200         900 300
Argentina   1984    1900        1425 475
Argentina   1991    1400        1050 350
Argentina   1995    1600        1200 400

What I would like to do is create an additional column for each (tax) column which holds the difference between the tax in one year and the tax in the next, conditional on it being the same country, like so:

country     year    dtaxrevenue dvat    dcit
Angola      2006      NA          NA      NA
Albania     1998      NA          NA      NA
Albania     2002     500         375     125
Albania     2005    -300        -225     -75
Argentina   1984      NA         525     175
Argentina   1991    -500        -375    -125
Argentina   1995     200         150      50

I had the following in mind:

#for each column in df
for(i in 1:ncol(TRcomplete)){
#add a new column
TRcomplete <- cbind(TRcomplete[,i])
# When condition holds..
if(TRcomplete$year[,i]- TRcomplete$year[,i+1] < 0) & TRcomplete$country[,i] == TRcomplete$country[,i+1]{
# try to subtract two values and put them in a new column
try(TRcomplete[,i+1] <- TRcomplete[,i]- TRcomplete[,i+1]
} else {
TRcomplete[,i+1]<-NA
}

I am mostly confused as to how to refer to each column. Can anybody offer some help?

Upvotes: 1

Views: 71

Answers (3)

DanY
DanY

Reputation: 6073

One version with data.table.

# make data
library(data.table)
dt <- data.table(
    country = c("Angola", "Albania", "Albania", "Albania", "Argentina", "Argentina", "Argentina"),
    year    = c(2006, 1998, 2002, 2005, 1984, 1991, 1995),
    tax     = c(1000, 1000, 1500, 1200, 1900, 1400, 1600),
    vat     = c(750,   750, 1125,  900, 1425, 1050, 1200),
    cit     = c(250,   250,  375,  300,  475,  350,  400)
)

# get lagged taxes
dt <- dt[order(country, year)]
dt[ , lag_tax := shift(tax), by=country]
dt[ , lag_vat := shift(vat), by=country]
dt[ , lag_cit := shift(cit), by=country]

# calculate tax differences
dt[ , dtax := tax - lag_tax]
dt[ , dvat := vat - lag_vat]
dt[ , dcit := cit - lag_cit]

If you needed to do this for many columns, use lapply and .SD:

# get lagged cols
dt[ , paste0(names(dt)[3:5], "_lag") := lapply(.SD, shift), by=country, .SDcols=3:5]

# or go straight to differences
dt[ , paste0(names(dt)[3:5], "_diff") := lapply(.SD, function(x) x - shift(x)), by=country, .SDcols=3:5]

Upvotes: 1

A. Suliman
A. Suliman

Reputation: 13125

Using dplyr we can try

library(dplyr)
df %>% group_by(country) %>% 
       mutate(Tax = taxrevenue-lag(taxrevenue))


  # A tibble: 7 x 6
  # Groups:   country [3]
  country    year taxrevenue   VAT   CIT   Tax
  <chr>     <int>      <int> <int> <int> <int>
1 Angola     2006       1000   750   250    NA
2 Albania    1998       1000   750   250    NA
3 Albania    2002       1500  1125   375   500
4 Albania    2005       1200   900   300  -300
5 Argentina  1984       1900  1425   475    NA
6 Argentina  1991       1400  1050   350  -500
7 Argentina  1995       1600  1200   400   200

Update

df %>%
 group_by(country) %>%
 mutate_if(is.numeric, funs(d = . - lag(.))) %>% select(-year_d) %>% 
 rename_at(vars(ends_with("_d")), funs(paste("d", gsub("_d", "", .), sep = "") )) %>% 
 head(n=5)

    # A tibble: 5 x 8
    # Groups:   country [3]
    country    year taxrevenue   VAT   CIT dtaxrevenue  dVAT  dCIT
    <fct>     <int>      <int> <int> <int>       <int> <int> <int>
      1 Angola     2006       1000   750   250          NA    NA    NA
    2 Albania    1998       1000   750   250          NA    NA    NA
    3 Albania    2002       1500  1125   375         500   375   125
    4 Albania    2005       1200   900   300        -300  -225   -75
    5 Argentina  1984       1900  1425   475          NA    NA    NA

Upvotes: 1

Prem
Prem

Reputation: 11955

mutate_at is the candidate

library(dplyr)

df %>%
  arrange(country, year) %>%  #sort data
  group_by(country) %>%
  mutate_at(vars(taxrevenue:CIT), funs(d = . - lag(.)))

which gives

  country    year taxrevenue   VAT   CIT taxrevenue_d VAT_d CIT_d
1 Albania    1998       1000   750   250           NA    NA    NA
2 Albania    2002       1500  1125   375          500   375   125
3 Albania    2005       1200   900   300         -300  -225  - 75
4 Angola     2006       1000   750   250           NA    NA    NA
5 Argentina  1984       1900  1425   475           NA    NA    NA
6 Argentina  1991       1400  1050   350         -500  -375  -125
7 Argentina  1995       1600  1200   400          200   150    50


Sample data:

df <- structure(list(country = c("Angola", "Albania", "Albania", "Albania", 
"Argentina", "Argentina", "Argentina"), year = c(2006L, 1998L, 
2002L, 2005L, 1984L, 1991L, 1995L), taxrevenue = c(1000L, 1000L, 
1500L, 1200L, 1900L, 1400L, 1600L), VAT = c(750L, 750L, 1125L, 
900L, 1425L, 1050L, 1200L), CIT = c(250L, 250L, 375L, 300L, 475L, 
350L, 400L)), .Names = c("country", "year", "taxrevenue", "VAT", 
"CIT"), class = "data.frame", row.names = c(NA, -7L))

Upvotes: 1

Related Questions