alias.123
alias.123

Reputation: 7

Merging datasets in R

I am trying to merge two datasets for table a and table b. It is for longitduinal purposes where I have created time invariant variables such as sex and education type.

joinedaandb <- full_join(tbl_df(tablea), tbl_df(tableb), by = "pidp") 

The unique identifier is pidp. There are a bunch of variables I was expecting to match up such as sex, edtype and age into one variable but instead it has created a sex.x variable and a sex.y variable.

pidp        <dbl> 280165, 541285, 541965, 665045, 956765, 987365, 1558565, 1833965, 229…
$ sex.x       <fct> female, male, female, male, male, female, female, male, male, female,…
$ edtype.x    <fct> proxy, proxy, proxy, inapplicable, proxy, proxy, at higher education …
$ age.x       <fct> 32, 25, 23, 29, 56, 21, 18, 46, 36, 17, 29, 22, 28, 57, 20, 33, 27, 6..

A bit further down these y variables appear.

$ sex.y      : Factor w/ 7 levels "missing","inapplicable",..: 7 6 NA 6 NA NA NA 6 6 NA ...
 $ edtype.y   : Factor w/ 10 levels "missing","inapplicable",..: 3 3 NA 2 NA NA NA 3 3 NA ...
 $ age.y      : Factor w/ 91 levels "missing","inapplicable",..: 23 16 NA 20 NA NA NA 37 26 NA ...

What does this mean? And how do I get it so that it matches the variables such as sex from tablea and tableb into a singular variable in the new dataframe.

Cheers

Upvotes: 0

Views: 172

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 145755

With joins (merges) you have two options for columns:

  • (a) you join on the column---include it in the by argument---and the values in each table will be compare and must be equal for the row to be joined
  • (b) you don't join on the column---not included in the by argument---and no assumptions are made about whether the values are equal. Columns from each data frame will be included separately, and it's up to you to handle them.

There are a bunch of variables I was expecting to match up

If you want them to be matched, you need to tell your full_join call. This means putting them in the by to your full_join. If you leave the by argument blank, dplyr's defaults will assume you want to match all columns with the same names. When you do specify a by argument, no more assumptions will be made, and any columns that appear in both data frames but do not appear in the by argument will be kept separately, with .x and .y appended so you can tell the sources apart.

Upvotes: 2

Related Questions