Reputation: 269
I have a long data frame
mydf <- data.frame(
+ date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01", "2016-06-01"),
+ value=c(1,2,3,4,5,1,2,3,4,5),
+ country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US"),
+ indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population", "population", "population"))
date value country indicator
1 2016-01-01 1 US gdp
2 2016-02-01 2 US gdp
3 2016-03-01 3 US gdp
4 2016-04-01 4 US gdp
5 2016-05-01 5 US gdp
6 2016-02-01 1 US population
7 2016-03-01 2 US population
8 2016-04-01 3 US population
9 2016-05-01 4 US population
10 2016-06-01 5 US population
I want to create specific new indicators that come from ratios, e.g. GDP/population*1000
It would look something like this, it has to match the right dates for each respective indicator
mydf <- data.frame(
+ date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01", "2016-06-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01"),
+ value=c(1,2,3,4,5,1,2,3,4,5,2,1.5,1.33,1.2),
+ country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US"),
+ indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population", "population", "population", "gdp per capita", "gdp per capita", "gdp per capita", "gdp per capita"))
date value country indicator
1 2016-01-01 1.00 US gdp
2 2016-02-01 2.00 US gdp
3 2016-03-01 3.00 US gdp
4 2016-04-01 4.00 US gdp
5 2016-05-01 5.00 US gdp
6 2016-02-01 1.00 US population
7 2016-03-01 2.00 US population
8 2016-04-01 3.00 US population
9 2016-05-01 4.00 US population
10 2016-06-01 5.00 US population
11 2016-02-01 2.00 US gdp per capita
12 2016-03-01 1.50 US gdp per capita
13 2016-04-01 1.33 US gdp per capita
14 2016-05-01 1.20 US gdp per capita
Is there an easy way to do this in R?
Upvotes: 3
Views: 55
Reputation: 1622
Personally I find the reshape package easier to work with, and it automatically handles having multiple countries/however many types of labels/data types you have.
library(reshape)
mydf <- data.frame(
date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01",
"2016-06-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01","2016-05-01"),
value=c(1,2,3,4,5,1,2,3,4,5,2,1.5,1.33,1.2, 2),
country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", 'AU'),
indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population",
"population", "population", "gdp per capita", "gdp per capita", "gdp per capita", "gdp per capita", 'gdp'))
To get the new indicator, first make your data into wide format so that the relevant columns are next to each other. This is so you can do simple column wise operations
df_wide = cast(mydf, date+country~indicator, sum)
You want the countries and dates as columns that uniquely define rows (left side of the formula), with differnet indicators as columns (right side of formula)
date country gdp gdp per capita population
1 2016-01-01 US 1 0.00 0
2 2016-02-01 US 2 2.00 1
3 2016-03-01 US 3 1.50 2
4 2016-04-01 US 4 1.33 3
5 2016-05-01 AU 2 0.00 0
6 2016-05-01 US 5 1.20 4
7 2016-06-01 US 0 0.00 5
Now make a new column and set this to whatever you want
df_wide['g_p_ratio'] = df_wide['gdp'] / df_wide['population']
Then use Melt to get this back to your long format
df_new = melt(df_wide, id=c('date'))
Voila!
date country value indicator
gdp 2016-01-01 US 1.00 gdp
gdp.1 2016-02-01 US 2.00 gdp
gdp.2 2016-03-01 US 3.00 gdp
gdp.3 2016-04-01 US 4.00 gdp
gdp.4 2016-05-01 AU 2.00 gdp
gdp.5 2016-05-01 US 5.00 gdp
gdp.6 2016-06-01 US 0.00 gdp
gdp.per.capita 2016-01-01 US 0.00 gdp per capita
gdp.per.capita.1 2016-02-01 US 2.00 gdp per capita
gdp.per.capita.2 2016-03-01 US 1.50 gdp per capita
gdp.per.capita.3 2016-04-01 US 1.33 gdp per capita
gdp.per.capita.4 2016-05-01 AU 0.00 gdp per capita
gdp.per.capita.5 2016-05-01 US 1.20 gdp per capita
gdp.per.capita.6 2016-06-01 US 0.00 gdp per capita
population 2016-01-01 US 0.00 population
population.1 2016-02-01 US 1.00 population
population.2 2016-03-01 US 2.00 population
population.3 2016-04-01 US 3.00 population
population.4 2016-05-01 AU 0.00 population
population.5 2016-05-01 US 4.00 population
population.6 2016-06-01 US 5.00 population
You may or may not want your new row labels, but you can fix that
rownames(df_new) <- 1:nrow(df_new)
Upvotes: 2
Reputation: 6264
Yes, I think it is easier to make the changes you want with a tidy approach using tidyr
and dplyr
.
library(dplyr)
library(tidyr)
df <- tribble(
~date, ~value, ~country, ~indicator,
"2016-01-01", 1, "US", "gdp",
"2016-02-01", 2, "US", "gdp",
"2016-03-01", 3, "AU", "gdp",
"2016-04-01", 4, "US", "gdp",
"2016-05-01", 5, "US", "gdp",
"2016-02-01", 1, "US", "population",
"2016-03-01", 2, "AU", "population",
"2016-04-01", 3, "US", "population",
"2016-05-01", 4, "US", "population",
"2016-06-01", 5, "US", "population"
)
df %>%
group_by(country) %>%
spread(indicator, value) %>%
mutate(`gdp per capita` = gdp / population) %>%
gather(indicator, value, -c(date, country)) %>%
drop_na(value)
# # A tibble: 14 x 4
# # Groups: country [2]
# date country indicator value
# <chr> <chr> <chr> <dbl>
# 1 2016-01-01 US gdp 1.000000
# 2 2016-02-01 US gdp 2.000000
# 3 2016-03-01 AU gdp 3.000000
# 4 2016-04-01 US gdp 4.000000
# 5 2016-05-01 US gdp 5.000000
# 6 2016-02-01 US population 1.000000
# 7 2016-03-01 AU population 2.000000
# 8 2016-04-01 US population 3.000000
# 9 2016-05-01 US population 4.000000
# 10 2016-06-01 US population 5.000000
# 11 2016-02-01 US gdp per capita 2.000000
# 12 2016-03-01 AU gdp per capita 1.500000
# 13 2016-04-01 US gdp per capita 1.333333
# 14 2016-05-01 US gdp per capita 1.250000
N.B. I've modified the data and added a
group_by
statement to demonstrate the solution with multiple values forcountry
.
Upvotes: 1