Reputation: 213
i need to merge two dataframes but i can't get output that makes sense to me. i have tried using left and full joins but both have problems.
one of these dataframes, main
, contains information about provider characteristics and the other, parquet
, contains information about whether a provider was affected by a certain policy. i need to merge them in order to do some analytical work.
i have a stata file called main
that looks like this after reading it into r using foreign
.
year prov_id tech_1 teach beds nonprof govt
1 2007 11Z111 0 0 35 0 1
2 2010 11Z111 0 0 35 0 1
3 2001 11Z111 0 0 35 0 1
4 2005 11Z111 0 0 35 0 1
5 2002 11Z111 0 0 35 0 1
6 2004 11Z111 0 0 35 0 1
and i have a parquet file called parquet
which contains information about whether prov_id
received treat
that looks like this after reading it into r using arrow
.
prov_id treat `__index_level_0__`
<chr> <dbl> <int>
1 11z111 0 0
2 11z113 0 1
3 11z132 0 2
4 11z135 0 3
5 11z13z 0 4
6 11z142 0 5
after doing a left join (left_join(main, parquet, by = 'prov_id')
), all values for treat
are NA
instead of 0
or 1
. i don't understand why this is happening. could it be related to the types of data that i'm working with? i have never worked with a parquet file before. this seems like it may not be relevant because r has turned both files into dataframes and i think that should make them compatible.
year prov_id tech_1 teach beds nonprof govt treat __index_level_0__
1 2007 11Z111 0 0 35 0 1 NA NA
2 2010 11Z111 0 0 35 0 1 NA NA
3 2001 11Z111 0 0 35 0 1 NA NA
4 2005 11Z111 0 0 35 0 1 NA NA
5 2002 11Z111 0 0 35 0 1 NA NA
6 2004 11Z111 0 0 35 0 1 NA NA
i think one of the issues may be that there are different numbers of unique prov_id
values in the two datasets, so producing some NA
values may be inevitable, but i don't think the whole treat
variable should consist only of NA
s.
> length(unique(parquet$prov_id))
[1] 1305
> length(unique(main$prov_id))
[1] 2132
i have also tried using a full join (i called the file i created using this type of join p
for simplicity) and the results are a bit different.
not all of the treat
values are NA
, but most of them are, and the number of NA
s is the same as the number of observations in main
. also, the numbers of 0
s and 1
s in p
are the same as the numbers of these values in parquet
.
another thing i should add that is kind of hard for me to show here is that when treat
isn't NA
in p
, the values for all other variables except prov_id
are. this means i can't use p
to estimate the relationships between treat
and any of the other variables.
> table(p$treat, useNA = 'ifany')
0 1 <NA>
1278 27 13050
> table(parquet$treat)
0 1
1278 27
> nrow(main)
[1] 13050
one last issue: i am not sure what __index_level_0__
in parquet
is supposed to represent. it's a 0 indexed sequence of numbers that counts the rows in this dataset, so maybe it was just included by mistake. but maybe it's supposed to represent a year variable? i'm not sure.
any advice for how i can get this working? i have tried other types of joins, but the results are even worse (0 observations). i need to get this data merged properly so that i can use it to estimate some statistical models, but my output so far is unusable.
thanks in advance for any advice.
Upvotes: 1
Views: 182
Reputation: 79246
Use parquet$prov_id <- toupper(parquet$prov_id)
to get uppercase Z in parquet
as commented by William. Here is how:
library(tidyverse)
main <- tribble(
~year, ~prov_id, ~tech_1, ~teach, ~beds, ~nonprof, ~govt,
"2007", "11Z111", 0, 0, 35, 0, 1,
"2010", "11Z111", 0, 0, 35, 0, 1,
"2001", "11Z111", 0, 0, 35, 0, 1,
"2005", "11Z111", 0, 0, 35, 0, 1,
"2002", "11Z111", 0, 0, 35, 0, 1,
"2004", "11Z111", 0, 0, 35, 0, 1)
parquet <- tribble(
~prov_id, ~treat, ~`__index_level_0__`,
"11z111", 0, 0,
"11z113", 0, 1,
"11z132", 0, 2,
"11z135", 0, 3,
"11z13z", 0, 4,
"11z142", 0, 5)
parquet$prov_id <- toupper(parquet$prov_id)
df_joined <- main %>%
full_join(parquet, by="prov_id")
Upvotes: 1