Logan
Logan

Reputation: 41

How do i merge two dataframes in R but keep all missing values.

I need to combine to data frames that have different lengths, and keep all the "missing values". The problem is that there are not really missing values, but rather just less of one value than another.

Example: df1 looks like this:

Shrub value period
1     0.5   1  
2     0.6   1
3     0.7   1
4     0.8   1 
5     0.9   1
10    0.9   1
1     0.4   2
5     0.4   2
6     0.5   2
7     0.3   2
2     0.4   3
3     0.1   3
8     0.5   3
9     0.2   3

df2 looks like this:

Shrub   x    y
1       5    8
2       6    7
3       3    2
4       1    2
5       4    6 
6       5    9
7       9    4 
8       2    1   
9       4    3
10      3    6

and i want the combined dataframe to look like:

Shrub   x    y    value   period
1       5    8    0.5        1
2       6    7    0.6        1
3       3    2    0.7        1
4       1    2    0.8        1
5       4    6    0.9        1 
6       5    9    NA         1
7       9    4    NA         1 
8       2    1    NA         1   
9       4    3    NA         1
10      3    6    0.9        1
1       5    8    0.4        2
2       6    7    NA         2
3       3    2    NA         2
4       1    2    NA         2
5       4    6    0.4        2
6       5    9    0.5        2
7       9    4    0.3        2
8       2    1    NA         2
9       4    3    NA         2
10      3    6    NA         2
1       5    8    NA         3
2       6    7    0.4        3
3       3    2    0.1        3
4       1    2    NA         3
5       4    6    NA         3
6       5    9    NA         3
7       9    4    NA         3 
8       2    1    0.5        3   
9       4    3    0.2        3
10      3    6    NA         3

I have tried the merge command using all = TRUE, but this does not give me what i want. I haven't been able to find this anywhere so any help is appreciated!

Upvotes: 2

Views: 7570

Answers (3)

aosmith
aosmith

Reputation: 36114

This is a situation where complete from package tidyr is useful (this is in tidyr_0.3.0, which is currently available on on github). You can use this function to expand df1 to include all period/Shrub combinations, filling the other variables in with NA by default. Once you do that you can simply join the two datasets together - I'll use inner_join from dplyr.

library(dplyr) library(tidyr)

First, using complete on df1, showing the first 10 lines of output:

complete(df1, period, Shrub)

Source: local data frame [30 x 3]

   period Shrub value
1       1     1   0.5
2       1     2   0.6
3       1     3   0.7
4       1     4   0.8
5       1     5   0.9
6       1     6    NA
7       1     7    NA
8       1     8    NA
9       1     9    NA
10      1    10   0.9
..    ...   ...   ...

Then all you need to do is join this expanded dataset with df2:

complete(df1, period, Shrub) %>%
    inner_join(., df2)
Source: local data frame [30 x 5]

   period Shrub value x y
1       1     1   0.5 5 8
2       1     2   0.6 6 7
3       1     3   0.7 3 2
4       1     4   0.8 1 2
5       1     5   0.9 4 6
6       1     6    NA 5 9
7       1     7    NA 9 4
8       1     8    NA 2 1
9       1     9    NA 4 3
10      1    10   0.9 3 6
..    ...   ...   ... . .

Upvotes: 8

jeremycg
jeremycg

Reputation: 24945

You can use dplyr. This works by taking each period in a seperate frame, and merging with all=TRUE to force all values, then putting it all back together. The cbind(df2,.. part adds on the period to the missing values so we don't get extra NA.:

library(dplyr)
df1 %>% group_by(period) %>%
        do(merge(., cbind(df2, period = .[["period"]][1]), by = c("Shrub", "period"), all = TRUE))

   Shrub period value x y
1      1      1   0.5 5 8
2      2      1   0.6 6 7
3      3      1   0.7 3 2
4      4      1   0.8 1 2
5      5      1   0.9 4 6
6      6      1    NA 5 9
7      7      1    NA 9 4
8      8      1    NA 2 1
9      9      1    NA 4 3
10    10      1   0.9 3 6
11     1      2   0.4 5 8
12     2      2    NA 6 7
13     3      2    NA 3 2
14     4      2    NA 1 2
15     5      2   0.4 4 6
16     6      2   0.5 5 9
17     7      2   0.3 9 4
18     8      2    NA 2 1
19     9      2    NA 4 3
20    10      2    NA 3 6
21     1      3    NA 5 8
22     2      3   0.4 6 7
23     3      3   0.1 3 2
24     4      3    NA 1 2
25     5      3    NA 4 6
26     6      3    NA 5 9
27     7      3    NA 9 4
28     8      3   0.5 2 1
29     9      3   0.2 4 3
30    10      3    NA 3 6

Upvotes: 1

Thomas
Thomas

Reputation: 44565

Start by repeating the rows of df2 to create a "full" dataset (i.e., 30 rows, one for each shrub-period observation), then merge:

tmp <- df2[rep(seq_len(nrow(df2)), times=3),]
tmp$period <- rep(1:3, each = nrow(df2))
out <- merge(tmp, df1, all = TRUE)
rm(tmp) # remove `tmp` data.frame

The result:

> head(out)
  Shrub period x y value
1     1      1 5 8   0.5
2     1      2 5 8   0.4
3     1      3 5 8    NA
4     2      1 6 7   0.6
5     2      2 6 7    NA
6     2      3 6 7   0.4

> str(out)
'data.frame':   30 obs. of  5 variables:
 $ Shrub : int  1 1 1 2 2 2 3 3 3 4 ...
 $ period: int  1 2 3 1 2 3 1 2 3 1 ...
 $ x     : int  5 5 5 6 6 6 3 3 3 1 ...
 $ y     : int  8 8 8 7 7 7 2 2 2 2 ...
 $ value : num  0.5 0.4 NA 0.6 NA 0.4 0.7 NA 0.1 0.8 ...

Upvotes: 6

Related Questions