Reputation: 1336
I have time series for different groups like where some values are missing:
library(tidyverse)
df <- tribble(
~year, ~country, ~variable,
#--|--|----
2003, "USA", 44,
2004, "USA", 40,
2005, "USA", 30,
# 2006 for USA is missing!
# 2007 for USA is missing!
# 2008 for USA is missing!
2009, "USA", 39,
2010, "USA", 55,
2011, "USA", 53,
2012, "USA", 71,
# 2003 for FRA is missing!
# 2004 for FRA is missing!
2005, "FRA", 10,
2006, "FRA", 8,
2007, "FRA", 13,
2008, "FRA", 12,
2009, "FRA", 18,
2010, "FRA", 39
# 2011 for FRA is missing!
# 2012 for FRA is missing!
)
When I plot my series, then geom_line()
connects the lines even when I have no observations in a year:
ggplot(df, aes(year, variable, color = country)) +
geom_line()
It looks fine for "FRA", as the missing data is at the beginning and end, but for "US" I don't want to connect the lines in 2006 to 2008.
What instead I would like is the following:
df <- tribble(
~year, ~country, ~variable,
#--|--|----
2003, "USA", 44,
2004, "USA", 40,
2005, "USA", 30,
2006, "USA", NA, # explicitly missing!
2007, "USA", NA, # explicitly missing!
2008, "USA", NA, # explicitly missing!
2009, "USA", 39,
2010, "USA", 55,
2011, "USA", 53,
2012, "USA", 71,
2003, "FRA", NA, # explicitly missing!
2004, "FRA", NA, # explicitly missing!
2005, "FRA", 10,
2006, "FRA", 8,
2007, "FRA", 13,
2008, "FRA", 12,
2009, "FRA", 18,
2010, "FRA", 39,
2011, "FRA", NA, # explicitly missing!
2012, "FRA", NA # explicitly missing!
)
ggplot(df, aes(year, variable, color = country)) +
geom_line()
Which makes:
In my real-life dataset I many groups and dates, so just plugging in the NA
s manually at the right place is not an option.
I tried doing some merge with the correct list of dates, but that doesn't solve it:
df %>%
right_join(tibble(year = seq(2003, 2012)))
Any ideas?
Upvotes: 1
Views: 670
Reputation: 70623
This worked for me:
set.seed(357)
xy <- data.frame(year = c(2003:2005, 2009:2012, 2005:2010),
country = c(rep("USA", 7), rep("FR", 6)),
vrbl = rnorm(7+6))
sxy <- split(xy, f = xy$country)
mxy <- data.frame(year = 2003:2012)
out <- sapply(sxy, FUN = function(x, mxy) {
out <- merge(x = mxy, y = x, all = TRUE)
out$country <- unique(x$country)
out
}, mxy = mxy, simplify = FALSE)
out <- do.call(rbind, out)
library(ggplot2)
ggplot(out, aes(x = year, y = vrbl, color = country)) +
theme_bw() +
geom_line()
year country vrbl
FR.1 2003 FR NA
FR.2 2004 FR NA
FR.3 2005 FR 0.22703071
FR.4 2006 FR -0.46901506
FR.5 2007 FR 0.47652129
FR.6 2008 FR -0.91164798
FR.7 2009 FR -0.34177516
FR.8 2010 FR 0.54674134
FR.9 2011 FR NA
FR.10 2012 FR NA
USA.1 2003 USA -1.24111731
USA.2 2004 USA -0.58320499
USA.3 2005 USA 0.39474705
USA.4 2006 USA NA
USA.5 2007 USA NA
USA.6 2008 USA NA
USA.7 2009 USA 1.50421107
USA.8 2010 USA 0.76679974
USA.9 2011 USA 0.31746044
USA.10 2012 USA -0.09997594
Upvotes: 0
Reputation: 3621
The problem is not with ggplot
but with your data. The solution is to do a merge before plotting the data. Create a data set with all the years and countries.
E.g. all_yr <- data.frame(year = 2000:2010, countries = c("CountryA","CountryB","CountryZ")
Then, do a merge between your real data set and this complete dataset (all_yr
). The merge
should include all the years and countries included in the all_yr
dataset. Those missing in your real_data
set will be populated with NA
.
E.g. merge(all_yr, real_data, by= year, all.x = TRUE)
Upvotes: 0
Reputation: 25375
You could use expand.grid to automatically create the missing values in your dataframe:
df2 = expand.grid(year=unique(df$year),country=unique(df$country)) %>% left_join(df)
ggplot(df2, aes(year, variable, color = country)) +
geom_line()
df2 will then look as follows:
year country variable
1 2003 USA 44
2 2004 USA 40
3 2005 USA 30
4 2009 USA 39
5 2010 USA 55
6 2011 USA 53
7 2012 USA 71
8 2006 USA NA
9 2007 USA NA
10 2008 USA NA
11 2003 FRA NA
12 2004 FRA NA
13 2005 FRA 10
14 2009 FRA 18
15 2010 FRA 39
16 2011 FRA NA
17 2012 FRA NA
18 2006 FRA 8
19 2007 FRA 13
20 2008 FRA 12
and the resulting plot:
Hope this helps!
Upvotes: 4