Reputation: 41
I need to combine to data frames that have different lengths, and keep all the "missing values". The problem is that there are not really missing values, but rather just less of one value than another.
Example: df1 looks like this:
Shrub value period
1 0.5 1
2 0.6 1
3 0.7 1
4 0.8 1
5 0.9 1
10 0.9 1
1 0.4 2
5 0.4 2
6 0.5 2
7 0.3 2
2 0.4 3
3 0.1 3
8 0.5 3
9 0.2 3
df2 looks like this:
Shrub x y
1 5 8
2 6 7
3 3 2
4 1 2
5 4 6
6 5 9
7 9 4
8 2 1
9 4 3
10 3 6
and i want the combined dataframe to look like:
Shrub x y value period
1 5 8 0.5 1
2 6 7 0.6 1
3 3 2 0.7 1
4 1 2 0.8 1
5 4 6 0.9 1
6 5 9 NA 1
7 9 4 NA 1
8 2 1 NA 1
9 4 3 NA 1
10 3 6 0.9 1
1 5 8 0.4 2
2 6 7 NA 2
3 3 2 NA 2
4 1 2 NA 2
5 4 6 0.4 2
6 5 9 0.5 2
7 9 4 0.3 2
8 2 1 NA 2
9 4 3 NA 2
10 3 6 NA 2
1 5 8 NA 3
2 6 7 0.4 3
3 3 2 0.1 3
4 1 2 NA 3
5 4 6 NA 3
6 5 9 NA 3
7 9 4 NA 3
8 2 1 0.5 3
9 4 3 0.2 3
10 3 6 NA 3
I have tried the merge command using all = TRUE, but this does not give me what i want. I haven't been able to find this anywhere so any help is appreciated!
Upvotes: 2
Views: 7570
Reputation: 36114
This is a situation where complete
from package tidyr is useful (this is in tidyr_0.3.0, which is currently available on on github). You can use this function to expand df1
to include all period/Shrub combinations, filling the other variables in with NA
by default. Once you do that you can simply join the two datasets together - I'll use inner_join
from dplyr.
library(dplyr) library(tidyr)
First, using complete
on df1
, showing the first 10 lines of output:
complete(df1, period, Shrub)
Source: local data frame [30 x 3]
period Shrub value
1 1 1 0.5
2 1 2 0.6
3 1 3 0.7
4 1 4 0.8
5 1 5 0.9
6 1 6 NA
7 1 7 NA
8 1 8 NA
9 1 9 NA
10 1 10 0.9
.. ... ... ...
Then all you need to do is join this expanded dataset with df2
:
complete(df1, period, Shrub) %>%
inner_join(., df2)
Source: local data frame [30 x 5]
period Shrub value x y
1 1 1 0.5 5 8
2 1 2 0.6 6 7
3 1 3 0.7 3 2
4 1 4 0.8 1 2
5 1 5 0.9 4 6
6 1 6 NA 5 9
7 1 7 NA 9 4
8 1 8 NA 2 1
9 1 9 NA 4 3
10 1 10 0.9 3 6
.. ... ... ... . .
Upvotes: 8
Reputation: 24945
You can use dplyr
. This works by taking each period in a seperate frame, and merging with all=TRUE
to force all values, then putting it all back together. The cbind(df2,..
part adds on the period to the missing values so we don't get extra NA
.:
library(dplyr)
df1 %>% group_by(period) %>%
do(merge(., cbind(df2, period = .[["period"]][1]), by = c("Shrub", "period"), all = TRUE))
Shrub period value x y
1 1 1 0.5 5 8
2 2 1 0.6 6 7
3 3 1 0.7 3 2
4 4 1 0.8 1 2
5 5 1 0.9 4 6
6 6 1 NA 5 9
7 7 1 NA 9 4
8 8 1 NA 2 1
9 9 1 NA 4 3
10 10 1 0.9 3 6
11 1 2 0.4 5 8
12 2 2 NA 6 7
13 3 2 NA 3 2
14 4 2 NA 1 2
15 5 2 0.4 4 6
16 6 2 0.5 5 9
17 7 2 0.3 9 4
18 8 2 NA 2 1
19 9 2 NA 4 3
20 10 2 NA 3 6
21 1 3 NA 5 8
22 2 3 0.4 6 7
23 3 3 0.1 3 2
24 4 3 NA 1 2
25 5 3 NA 4 6
26 6 3 NA 5 9
27 7 3 NA 9 4
28 8 3 0.5 2 1
29 9 3 0.2 4 3
30 10 3 NA 3 6
Upvotes: 1
Reputation: 44565
Start by repeating the rows of df2
to create a "full" dataset (i.e., 30 rows, one for each shrub-period observation), then merge:
tmp <- df2[rep(seq_len(nrow(df2)), times=3),]
tmp$period <- rep(1:3, each = nrow(df2))
out <- merge(tmp, df1, all = TRUE)
rm(tmp) # remove `tmp` data.frame
The result:
> head(out)
Shrub period x y value
1 1 1 5 8 0.5
2 1 2 5 8 0.4
3 1 3 5 8 NA
4 2 1 6 7 0.6
5 2 2 6 7 NA
6 2 3 6 7 0.4
> str(out)
'data.frame': 30 obs. of 5 variables:
$ Shrub : int 1 1 1 2 2 2 3 3 3 4 ...
$ period: int 1 2 3 1 2 3 1 2 3 1 ...
$ x : int 5 5 5 6 6 6 3 3 3 1 ...
$ y : int 8 8 8 7 7 7 2 2 2 2 ...
$ value : num 0.5 0.4 NA 0.6 NA 0.4 0.7 NA 0.1 0.8 ...
Upvotes: 6