ulima2_
ulima2_

Reputation: 1336

Inserting NA for missing observation in time series for correct line plot

I have time series for different groups like where some values are missing:

library(tidyverse)

df <- tribble(
  ~year, ~country, ~variable, 
  #--|--|----
  2003, "USA", 44,
  2004, "USA", 40,
  2005, "USA", 30,
  # 2006 for USA is missing!
  # 2007 for USA is missing!
  # 2008 for USA is missing!
  2009, "USA", 39,
  2010, "USA", 55,
  2011, "USA", 53,
  2012, "USA", 71,
  # 2003 for FRA is missing!
  # 2004 for FRA is missing!
  2005, "FRA", 10,
  2006, "FRA", 8,
  2007, "FRA", 13,
  2008, "FRA", 12,
  2009, "FRA", 18,
  2010, "FRA", 39
  # 2011 for FRA is missing!
  # 2012 for FRA is missing!
)

When I plot my series, then geom_line() connects the lines even when I have no observations in a year:

ggplot(df, aes(year, variable, color = country)) +
  geom_line()

enter image description here

It looks fine for "FRA", as the missing data is at the beginning and end, but for "US" I don't want to connect the lines in 2006 to 2008.

What instead I would like is the following:

df <- tribble(
  ~year, ~country, ~variable, 
  #--|--|----
  2003, "USA", 44,
  2004, "USA", 40,
  2005, "USA", 30,
  2006, "USA", NA, # explicitly missing!
  2007, "USA", NA, # explicitly missing!
  2008, "USA", NA, # explicitly missing!
  2009, "USA", 39,
  2010, "USA", 55,
  2011, "USA", 53,
  2012, "USA", 71,
  2003, "FRA", NA, # explicitly missing!
  2004, "FRA", NA, # explicitly missing!
  2005, "FRA", 10,
  2006, "FRA", 8,
  2007, "FRA", 13,
  2008, "FRA", 12,
  2009, "FRA", 18,
  2010, "FRA", 39,
  2011, "FRA", NA, # explicitly missing!
  2012, "FRA", NA # explicitly missing!
)

ggplot(df, aes(year, variable, color = country)) +
  geom_line()

Which makes:

enter image description here

In my real-life dataset I many groups and dates, so just plugging in the NAs manually at the right place is not an option.

I tried doing some merge with the correct list of dates, but that doesn't solve it:

df %>% 
  right_join(tibble(year = seq(2003, 2012)))

Any ideas?

Upvotes: 1

Views: 670

Answers (3)

Roman Luštrik
Roman Luštrik

Reputation: 70623

This worked for me:

set.seed(357)
xy <- data.frame(year = c(2003:2005, 2009:2012, 2005:2010),
                 country = c(rep("USA", 7), rep("FR", 6)),
                 vrbl = rnorm(7+6))

sxy <- split(xy, f = xy$country)
mxy <- data.frame(year = 2003:2012)

out <- sapply(sxy, FUN = function(x, mxy) {
  out <- merge(x = mxy, y = x, all = TRUE)
  out$country <- unique(x$country)
  out
}, mxy = mxy, simplify = FALSE)
out <- do.call(rbind, out)

library(ggplot2)

ggplot(out, aes(x = year, y = vrbl, color = country)) +
  theme_bw() +
  geom_line()

       year country        vrbl
FR.1   2003      FR          NA
FR.2   2004      FR          NA
FR.3   2005      FR  0.22703071
FR.4   2006      FR -0.46901506
FR.5   2007      FR  0.47652129
FR.6   2008      FR -0.91164798
FR.7   2009      FR -0.34177516
FR.8   2010      FR  0.54674134
FR.9   2011      FR          NA
FR.10  2012      FR          NA
USA.1  2003     USA -1.24111731
USA.2  2004     USA -0.58320499
USA.3  2005     USA  0.39474705
USA.4  2006     USA          NA
USA.5  2007     USA          NA
USA.6  2008     USA          NA
USA.7  2009     USA  1.50421107
USA.8  2010     USA  0.76679974
USA.9  2011     USA  0.31746044
USA.10 2012     USA -0.09997594

Upvotes: 0

User981636
User981636

Reputation: 3621

The problem is not with ggplot but with your data. The solution is to do a merge before plotting the data. Create a data set with all the years and countries.

E.g. all_yr <- data.frame(year = 2000:2010, countries = c("CountryA","CountryB","CountryZ")

Then, do a merge between your real data set and this complete dataset (all_yr). The merge should include all the years and countries included in the all_yr dataset. Those missing in your real_data set will be populated with NA.

E.g. merge(all_yr, real_data, by= year, all.x = TRUE)

Upvotes: 0

Florian
Florian

Reputation: 25375

You could use expand.grid to automatically create the missing values in your dataframe:

df2 = expand.grid(year=unique(df$year),country=unique(df$country)) %>% left_join(df)

ggplot(df2, aes(year, variable, color = country)) +
  geom_line()

df2 will then look as follows:

   year country variable
1  2003     USA       44
2  2004     USA       40
3  2005     USA       30
4  2009     USA       39
5  2010     USA       55
6  2011     USA       53
7  2012     USA       71
8  2006     USA       NA
9  2007     USA       NA
10 2008     USA       NA
11 2003     FRA       NA
12 2004     FRA       NA
13 2005     FRA       10
14 2009     FRA       18
15 2010     FRA       39
16 2011     FRA       NA
17 2012     FRA       NA
18 2006     FRA        8
19 2007     FRA       13
20 2008     FRA       12

and the resulting plot:

enter image description here

Hope this helps!

Upvotes: 4

Related Questions