qwerty
qwerty

Reputation: 213

trouble merging a couple of dataframes

i need to merge two dataframes but i can't get output that makes sense to me. i have tried using left and full joins but both have problems.

one of these dataframes, main, contains information about provider characteristics and the other, parquet, contains information about whether a provider was affected by a certain policy. i need to merge them in order to do some analytical work.

i have a stata file called main that looks like this after reading it into r using foreign.

year    prov_id tech_1 teach beds nonprof govt
1 2007  11Z111      0     0   35       0    1
2 2010  11Z111      0     0   35       0    1
3 2001  11Z111      0     0   35       0    1
4 2005  11Z111      0     0   35       0    1
5 2002  11Z111      0     0   35       0    1
6 2004  11Z111      0     0   35       0    1

and i have a parquet file called parquet which contains information about whether prov_id received treat that looks like this after reading it into r using arrow.

prov_id    treat `__index_level_0__`
  <chr>   <dbl>               <int>
1 11z111      0                   0
2 11z113      0                   1
3 11z132      0                   2
4 11z135      0                   3
5 11z13z      0                   4
6 11z142      0                   5

after doing a left join (left_join(main, parquet, by = 'prov_id')), all values for treat are NA instead of 0 or 1. i don't understand why this is happening. could it be related to the types of data that i'm working with? i have never worked with a parquet file before. this seems like it may not be relevant because r has turned both files into dataframes and i think that should make them compatible.

year    prov_id tech_1 teach beds nonprof govt treat __index_level_0__
1 2007  11Z111      0     0   35       0    1    NA                NA
2 2010  11Z111      0     0   35       0    1    NA                NA
3 2001  11Z111      0     0   35       0    1    NA                NA
4 2005  11Z111      0     0   35       0    1    NA                NA
5 2002  11Z111      0     0   35       0    1    NA                NA
6 2004  11Z111      0     0   35       0    1    NA                NA

i think one of the issues may be that there are different numbers of unique prov_id values in the two datasets, so producing some NA values may be inevitable, but i don't think the whole treat variable should consist only of NAs.

> length(unique(parquet$prov_id))
[1] 1305
> length(unique(main$prov_id))
[1] 2132

i have also tried using a full join (i called the file i created using this type of join p for simplicity) and the results are a bit different.

not all of the treat values are NA, but most of them are, and the number of NAs is the same as the number of observations in main. also, the numbers of 0s and 1s in p are the same as the numbers of these values in parquet.

another thing i should add that is kind of hard for me to show here is that when treat isn't NA in p, the values for all other variables except prov_id are. this means i can't use p to estimate the relationships between treat and any of the other variables.

> table(p$treat, useNA = 'ifany')

    0     1  <NA> 
 1278    27 13050 

> table(parquet$treat)

   0    1 
1278   27 

> nrow(main)
[1] 13050

one last issue: i am not sure what __index_level_0__ in parquet is supposed to represent. it's a 0 indexed sequence of numbers that counts the rows in this dataset, so maybe it was just included by mistake. but maybe it's supposed to represent a year variable? i'm not sure.

any advice for how i can get this working? i have tried other types of joins, but the results are even worse (0 observations). i need to get this data merged properly so that i can use it to estimate some statistical models, but my output so far is unusable.

thanks in advance for any advice.

Upvotes: 1

Views: 182

Answers (1)

TarJae
TarJae

Reputation: 79246

Use parquet$prov_id <- toupper(parquet$prov_id) to get uppercase Z in parquet as commented by William. Here is how:

library(tidyverse)

main <- tribble(
~year,    ~prov_id, ~tech_1, ~teach, ~beds, ~nonprof, ~govt,
"2007", "11Z111", 0, 0, 35, 0, 1, 
"2010", "11Z111", 0, 0, 35, 0, 1, 
"2001", "11Z111", 0, 0, 35, 0, 1, 
"2005", "11Z111", 0, 0, 35, 0, 1, 
"2002", "11Z111", 0, 0, 35, 0, 1, 
"2004", "11Z111", 0, 0, 35, 0, 1)


parquet <- tribble(
  ~prov_id,   ~treat, ~`__index_level_0__`,
"11z111", 0, 0,
"11z113", 0, 1, 
"11z132", 0, 2,
"11z135", 0, 3, 
"11z13z", 0, 4, 
"11z142", 0, 5) 

parquet$prov_id <- toupper(parquet$prov_id)

df_joined <- main %>% 
  full_join(parquet, by="prov_id")

enter image description here

Upvotes: 1

Related Questions