stochastiq
stochastiq

Reputation: 269

Create new ratio indicator in long data

I have a long data frame

mydf <- data.frame(
+     date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01", "2016-06-01"),
+     value=c(1,2,3,4,5,1,2,3,4,5),
+     country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US"),
+     indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population", "population", "population"))

         date value country  indicator
1  2016-01-01     1      US        gdp
2  2016-02-01     2      US        gdp
3  2016-03-01     3      US        gdp
4  2016-04-01     4      US        gdp
5  2016-05-01     5      US        gdp
6  2016-02-01     1      US population
7  2016-03-01     2      US population
8  2016-04-01     3      US population
9  2016-05-01     4      US population
10 2016-06-01     5      US population

I want to create specific new indicators that come from ratios, e.g. GDP/population*1000

It would look something like this, it has to match the right dates for each respective indicator

mydf <- data.frame(
+     date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01", "2016-06-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01"),
+     value=c(1,2,3,4,5,1,2,3,4,5,2,1.5,1.33,1.2),
+     country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US"),
+     indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population", "population", "population", "gdp per capita", "gdp per capita", "gdp per capita", "gdp per capita"))

         date value country      indicator
1  2016-01-01  1.00      US            gdp
2  2016-02-01  2.00      US            gdp
3  2016-03-01  3.00      US            gdp
4  2016-04-01  4.00      US            gdp
5  2016-05-01  5.00      US            gdp
6  2016-02-01  1.00      US     population
7  2016-03-01  2.00      US     population
8  2016-04-01  3.00      US     population
9  2016-05-01  4.00      US     population
10 2016-06-01  5.00      US     population
11 2016-02-01  2.00      US gdp per capita
12 2016-03-01  1.50      US gdp per capita
13 2016-04-01  1.33      US gdp per capita
14 2016-05-01  1.20      US gdp per capita

Is there an easy way to do this in R?

Upvotes: 3

Views: 55

Answers (2)

Hansang
Hansang

Reputation: 1622

Personally I find the reshape package easier to work with, and it automatically handles having multiple countries/however many types of labels/data types you have.

library(reshape)
mydf <- data.frame(
date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01", 
       "2016-06-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01","2016-05-01"),
value=c(1,2,3,4,5,1,2,3,4,5,2,1.5,1.33,1.2, 2),
country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", 'AU'),
indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population",
            "population", "population", "gdp per capita", "gdp per capita", "gdp per capita", "gdp per capita", 'gdp'))

To get the new indicator, first make your data into wide format so that the relevant columns are next to each other. This is so you can do simple column wise operations

df_wide = cast(mydf, date+country~indicator, sum)

You want the countries and dates as columns that uniquely define rows (left side of the formula), with differnet indicators as columns (right side of formula)

        date country gdp gdp per capita population
1 2016-01-01      US   1           0.00          0
2 2016-02-01      US   2           2.00          1
3 2016-03-01      US   3           1.50          2
4 2016-04-01      US   4           1.33          3
5 2016-05-01      AU   2           0.00          0
6 2016-05-01      US   5           1.20          4
7 2016-06-01      US   0           0.00          5

Now make a new column and set this to whatever you want

df_wide['g_p_ratio'] = df_wide['gdp'] / df_wide['population'] 

Then use Melt to get this back to your long format

df_new = melt(df_wide, id=c('date'))

Voila!

                       date country value      indicator
gdp              2016-01-01      US  1.00            gdp
gdp.1            2016-02-01      US  2.00            gdp
gdp.2            2016-03-01      US  3.00            gdp
gdp.3            2016-04-01      US  4.00            gdp
gdp.4            2016-05-01      AU  2.00            gdp
gdp.5            2016-05-01      US  5.00            gdp
gdp.6            2016-06-01      US  0.00            gdp
gdp.per.capita   2016-01-01      US  0.00 gdp per capita
gdp.per.capita.1 2016-02-01      US  2.00 gdp per capita
gdp.per.capita.2 2016-03-01      US  1.50 gdp per capita
gdp.per.capita.3 2016-04-01      US  1.33 gdp per capita
gdp.per.capita.4 2016-05-01      AU  0.00 gdp per capita
gdp.per.capita.5 2016-05-01      US  1.20 gdp per capita
gdp.per.capita.6 2016-06-01      US  0.00 gdp per capita
population       2016-01-01      US  0.00     population
population.1     2016-02-01      US  1.00     population
population.2     2016-03-01      US  2.00     population
population.3     2016-04-01      US  3.00     population
population.4     2016-05-01      AU  0.00     population
population.5     2016-05-01      US  4.00     population
population.6     2016-06-01      US  5.00     population

You may or may not want your new row labels, but you can fix that

rownames(df_new) <- 1:nrow(df_new)

Upvotes: 2

Kevin Arseneau
Kevin Arseneau

Reputation: 6264

Yes, I think it is easier to make the changes you want with a tidy approach using tidyr and dplyr.

library(dplyr)
library(tidyr)

df <- tribble(
         ~date, ~value, ~country,   ~indicator,
  "2016-01-01",      1,     "US",        "gdp",
  "2016-02-01",      2,     "US",        "gdp",
  "2016-03-01",      3,     "AU",        "gdp",
  "2016-04-01",      4,     "US",        "gdp",
  "2016-05-01",      5,     "US",        "gdp",
  "2016-02-01",      1,     "US", "population",
  "2016-03-01",      2,     "AU", "population",
  "2016-04-01",      3,     "US", "population",
  "2016-05-01",      4,     "US", "population",
  "2016-06-01",      5,     "US", "population"
)

df %>%
  group_by(country) %>%
  spread(indicator, value) %>%
  mutate(`gdp per capita` = gdp / population) %>%
  gather(indicator, value, -c(date, country)) %>%
  drop_na(value)

# # A tibble: 14 x 4
# # Groups:   country [2]
#          date country      indicator    value
#         <chr>   <chr>          <chr>    <dbl>
#  1 2016-01-01      US            gdp 1.000000
#  2 2016-02-01      US            gdp 2.000000
#  3 2016-03-01      AU            gdp 3.000000
#  4 2016-04-01      US            gdp 4.000000
#  5 2016-05-01      US            gdp 5.000000
#  6 2016-02-01      US     population 1.000000
#  7 2016-03-01      AU     population 2.000000
#  8 2016-04-01      US     population 3.000000
#  9 2016-05-01      US     population 4.000000
# 10 2016-06-01      US     population 5.000000
# 11 2016-02-01      US gdp per capita 2.000000
# 12 2016-03-01      AU gdp per capita 1.500000
# 13 2016-04-01      US gdp per capita 1.333333
# 14 2016-05-01      US gdp per capita 1.250000

N.B. I've modified the data and added a group_by statement to demonstrate the solution with multiple values for country.

Upvotes: 1

Related Questions