Bhail
Bhail

Reputation: 407

merge two data frames based on matching rows of multiple columns

Below is the summary and structure of the two data sets I tried to merge claimants and unemp, they can me found here claims.csv and unemp.csv

 > tbl_df(claimants)
# A tibble: 6,960 × 5
       X       County  Month  Year Claimants
   <int>       <fctr> <fctr> <int>     <int>
1      1      ALAMEDA    Jan  2007     13034
2      2       ALPINE    Jan  2007        12
3      3       AMADOR    Jan  2007       487
4      4        BUTTE    Jan  2007      3496
5      5    CALAVERAS    Jan  2007       644
6      6       COLUSA    Jan  2007      1244
7      7 CONTRA COSTA    Jan  2007      8475
8      8    DEL NORTE    Jan  2007       328
9      9    EL DORADO    Jan  2007      2120
10    10       FRESNO    Jan  2007     19974
# ... with 6,950 more rows


> tbl_df(unemp)
# A tibble: 6,960 × 7
    County  Year Month laborforce emplab unemp unemprate
*    <chr> <int> <chr>      <int>  <int> <int>     <dbl>
1  Alameda  2007   Jan     743100 708300 34800       4.7
2  Alameda  2007   Feb     744800 711000 33800       4.5
3  Alameda  2007   Mar     746600 713200 33300       4.5
4  Alameda  2007   Apr     738200 705800 32400       4.4
5  Alameda  2007   May     739100 707300 31800       4.3
6  Alameda  2007   Jun     744900 709100 35800       4.8
7  Alameda  2007   Jul     749600 710900 38700       5.2
8  Alameda  2007   Aug     746700 709600 37000       5.0
9  Alameda  2007   Sep     748200 712100 36000       4.8
10 Alameda  2007   Oct     749000 713000 36100       4.8
# ... with 6,950 more rows

I thought first I should change all the factor columns to character columns.

unemp[sapply(unemp, is.factor)] <- lapply(unemp[sapply(unemp, is.factor)], as.character)

claimants[sapply(claimants, is.factor)] <- lapply(claimants[sapply(claimants, is.factor)], as.character)

m <-merge(unemp, claimants, by = c("County", "Month", "Year"))
dim(m)
[1]  0 10

In the output of dim(m), 0 rows are in the resulting dataframe. All the 6960 rows should match each other uniquely.

To verify that the two data frames have unique combination of the the 3 columns 'County', 'Month', and 'Year' I reorder and rearrange these columns within the dataframes as below:

a <- claimants[ order(claimants[,"County"], claimants[,"Month"], claimants[,"Year"]), ]

b <- unemp[ order(unemp[,"County"], unemp[,"Month"], unemp[,"Year"]), ]

b[2:4] <- b[c(2,4,3)]
a[2:4] %in% b[2:4]
[1] TRUE TRUE TRUE

This last output confirms that all 'County', 'Month', and 'Year' columns match each other in these two dataframes.

I have tried looking into the documentation for merge and could not gather where do I go wrong, I have also tried the inner_join function from dplyr:

> m <- inner_join(unemp[2:8], claimants[2:5])
Joining, by = c("County", "Year", "Month")
> dim(m)
[1] 0 8 

I am missing something and don't know what, would appreciate the help with understanding this, I know I should not have to rearrange the rows by the three columns to run merge R should identify the matching rows and merge the non-matching columns.

Upvotes: 1

Views: 2057

Answers (1)

msubbaiah
msubbaiah

Reputation: 350

The claimants df has the counties in all uppercase, the unemp df has them in lower case.

I used the options(stringsAsFactors = FALSE) when reading in your data. A few suggestions drop the X column in both, it doesn't seem useful.

options(stringsAsFactors = FALSE)
claims <- read.csv("claims.csv",header=TRUE)
claims$X <- NULL
unemp <- read.csv("unemp.csv",header=TRUE)
unemp$X <- NULL
unemp$County <- toupper(unemp$County)

m <- inner_join(unemp, claims)
dim(m)

# [1] 6960    8

Upvotes: 2

Related Questions