user6131384
user6131384

Reputation:

Add lines with NA values

I have a data frame like this:

   indx country year  death value
1     1   Italy 2000    hiv     1
2     1   Italy 2001    hiv     2
3     1   Italy 2005    hiv     3
4     1   Italy 2000 cancer     4
5     1   Italy 2001 cancer     5
6     1   Italy 2002 cancer     6
7     1   Italy 2003 cancer     7
8     1   Italy 2004 cancer     8
9     1   Italy 2005 cancer     9
10    4  France 2000    hiv    10
11    4  France 2004    hiv    11
12    4  France 2005    hiv    12
13    4  France 2001 cancer    13
14    4  France 2002 cancer    14
15    4  France 2003 cancer    15
16    4  France 2004 cancer    16
17    2   Spain 2000    hiv    17
18    2   Spain 2001    hiv    18
19    2   Spain 2002    hiv    19
20    2   Spain 2003    hiv    20
21    2   Spain 2004    hiv    21
22    2   Spain 2005    hiv    22
23    2   Spain  ...    ...    ...

indx is a value linked to the country (same country = same indx).

In this example I used only 3 countries (country) and 2 disease (death), in the original data frame are many more.

I would like to have one row for each country for each disease from 2000 to 2005.

What I would like to get is:

    indx  country  year  death  value
1      1    Italy  2000    hiv      1
2      1    Italy  2001    hiv      2
3      1    Italy  2002    hiv     NA
4      1    Italy  2003    hiv     NA
5      1    Italy  2004    hiv     NA
6      1    Italy  2005    hiv      3
7      1    Italy  2000 cancer      4
8      1    Italy  2001 cancer      5
9      1    Italy  2002 cancer      6
10     1    Italy  2003 cancer      7
11     1    Italy  2004 cancer      8
12     1    Italy  2005 cancer      9
13     4   France  2000    hiv     10
14     4   France  2001    hiv     NA
15     4   France  2002    hiv     NA
16     4   France  2003    hiv     NA
17     4   France  2004    hiv     11
18     4   France  2005    hiv     12
19     4   France  2000 cancer     NA
20     4   France  2001 cancer     13
21     4   France  2002 cancer     14
22     4   France  2003 cancer     15
23     4   France  2004 cancer     16
24     4   France  2005 cancer     NA
25     2    Spain  2000    hiv     17
26     2    Spain  2001    hiv     18
27     2    Spain  2002    hiv     19
28     2    Spain  2003    hiv     20
29     2    Spain  2004    hiv     21
30     2    Spain  2005    hiv     22
31     2    Spain  ...     ...     ...

I.e. I would like to add lines with value = NA at the missing years for each country for each disease.

For example, it lacks data of HIV in Italy between 2002 and 2004 and then I add this lines with value = NA.

How can I do that?

For a reproducible example:

indx <- c(rep(1, times=9), rep(4, times=7), rep(2, times=6))
country <- c(rep("Italy", times=9), rep("France", times=7), rep("Spain", times=6))
year <- c(2000, 2001, 2005, 2000:2005, 2000, 2004, 2005, 2001:2004, 2000:2005)
death <- c(rep("hiv", times=3), rep("cancer", times=6), rep("hiv", times=3), rep("cancer", times=4), rep("hiv", times=6))
value <- c(1:22)
dfl <- data.frame(indx, country, year, death, value)

Upvotes: 2

Views: 255

Answers (3)

lmo
lmo

Reputation: 38500

Here is a longer base R method. You create two new data.frames, one that contains all combinations of the country, year, and death, and a second that contains an index key.

# get data.frame with every combination of country, year, and death
dfNew <- with(df, expand.grid("country"=unique(country), "year"=unique(year),
                              "death"=unique(death)))

# get index key
indexKey <- unique(df[, c("indx", "country")])

# merge these together
dfNew <- merge(indexKey, dfNew, by="country")

# merge onto original data set
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)

This returns

dfNew
   indx country year  death value
1     1   Italy 2000 cancer     4
2     1   Italy 2000    hiv     1
3     1   Italy 2001 cancer     5
4     1   Italy 2001    hiv     2
5     1   Italy 2002 cancer     6
6     1   Italy 2002    hiv    NA
7     1   Italy 2003 cancer     7
8     1   Italy 2003    hiv    NA
9     1   Italy 2004 cancer     8
10    1   Italy 2004    hiv    NA
11    1   Italy 2005 cancer     9
12    1   Italy 2005    hiv     3
13    2   Spain 2000 cancer    NA
14    2   Spain 2000    hiv    17
15    2   Spain 2001 cancer    NA
...

If df is a data.table, here are the corresponding lines of code:

# CJ is a cross-join
setkey(df, country, year, death)
dfNew <- df[CJ(country, year, death, unique=TRUE),
            .(country, year, death, value)]

indexKey <- unique(df[, .(indx, country)])

dfNew <- merge(indexKey, dfNew, by="country")

dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)

Note that it rather than using CJ, it is also possible to use expand.grid as in the data.frame version:

dfNew <- df[, expand.grid("country"=unique(country), "year"=unique(year),
                          "death"=unique(death))]

Upvotes: 1

talat
talat

Reputation: 70266

Using base R, you could do:

# setDF(dfl) # run this first if you have a data.table
merge(expand.grid(lapply(dfl[c("country", "death", "year")], unique)), dfl, all.x = TRUE)

This first creates all combinations of the unique values in country, death, and year and then merges it to the original data, to add the values and where combinations were not in the original data, it adds NAs.

In the package tidyr, there's a special function that does this for you with a a single command:

library(tidyr)
complete(dfl, country, year, death)

Upvotes: 3

alistaire
alistaire

Reputation: 43334

tidyr::complete helps create all combinations of the variables you pass it, but if you have two columns that are identical, it will over-expand or leave NAs where you don't want. As a workaround you can use dplyr grouping (df %>% group_by(indx, country) %>% complete(death, year)) or just merge the two columns into one temporarily:

library(tidyr)

       # merge indx and country into a single column so they won't over-expand
df %>% unite(indx_country, indx, country) %>% 
    # fill in missing combinations of new column, death, and year
    complete(indx_country, death, year) %>% 
    # separate indx and country back to how they were
    separate(indx_country, c('indx', 'country'))

# Source: local data frame [36 x 5]
# 
#     indx country  death  year value
#    (chr)   (chr) (fctr) (int) (int)
# 1      1   Italy cancer  2000     4
# 2      1   Italy cancer  2001     5
# 3      1   Italy cancer  2002     6
# 4      1   Italy cancer  2003     7
# 5      1   Italy cancer  2004     8
# 6      1   Italy cancer  2005     9
# 7      1   Italy    hiv  2000     1
# 8      1   Italy    hiv  2001     2
# 9      1   Italy    hiv  2002    NA
# 10     1   Italy    hiv  2003    NA
# ..   ...     ...    ...   ...   ...

Upvotes: 0

Related Questions