Reputation: 7
I am trying to merge two datasets for table a and table b. It is for longitduinal purposes where I have created time invariant variables such as sex and education type.
joinedaandb <- full_join(tbl_df(tablea), tbl_df(tableb), by = "pidp")
The unique identifier is pidp. There are a bunch of variables I was expecting to match up such as sex, edtype and age into one variable but instead it has created a sex.x variable and a sex.y variable.
pidp <dbl> 280165, 541285, 541965, 665045, 956765, 987365, 1558565, 1833965, 229…
$ sex.x <fct> female, male, female, male, male, female, female, male, male, female,…
$ edtype.x <fct> proxy, proxy, proxy, inapplicable, proxy, proxy, at higher education …
$ age.x <fct> 32, 25, 23, 29, 56, 21, 18, 46, 36, 17, 29, 22, 28, 57, 20, 33, 27, 6..
A bit further down these y variables appear.
$ sex.y : Factor w/ 7 levels "missing","inapplicable",..: 7 6 NA 6 NA NA NA 6 6 NA ...
$ edtype.y : Factor w/ 10 levels "missing","inapplicable",..: 3 3 NA 2 NA NA NA 3 3 NA ...
$ age.y : Factor w/ 91 levels "missing","inapplicable",..: 23 16 NA 20 NA NA NA 37 26 NA ...
What does this mean? And how do I get it so that it matches the variables such as sex from tablea and tableb into a singular variable in the new dataframe.
Cheers
Upvotes: 0
Views: 172
Reputation: 145755
With joins (merges) you have two options for columns:
by
argument---and the values in each table will be compare and must be equal for the row to be joinedby
argument---and no assumptions are made about whether the values are equal. Columns from each data frame will be included separately, and it's up to you to handle them.There are a bunch of variables I was expecting to match up
If you want them to be matched, you need to tell your full_join
call. This means putting them in the by
to your full_join
. If you leave the by
argument blank, dplyr
's defaults will assume you want to match all columns with the same names. When you do specify a by
argument, no more assumptions will be made, and any columns that appear in both data frames but do not appear in the by
argument will be kept separately, with .x
and .y
appended so you can tell the sources apart.
Upvotes: 2