Reputation: 407
Below is the summary and structure of the two data sets I tried to merge claimants
and unemp
, they can me found here claims.csv and unemp.csv
> tbl_df(claimants)
# A tibble: 6,960 × 5
X County Month Year Claimants
<int> <fctr> <fctr> <int> <int>
1 1 ALAMEDA Jan 2007 13034
2 2 ALPINE Jan 2007 12
3 3 AMADOR Jan 2007 487
4 4 BUTTE Jan 2007 3496
5 5 CALAVERAS Jan 2007 644
6 6 COLUSA Jan 2007 1244
7 7 CONTRA COSTA Jan 2007 8475
8 8 DEL NORTE Jan 2007 328
9 9 EL DORADO Jan 2007 2120
10 10 FRESNO Jan 2007 19974
# ... with 6,950 more rows
> tbl_df(unemp)
# A tibble: 6,960 × 7
County Year Month laborforce emplab unemp unemprate
* <chr> <int> <chr> <int> <int> <int> <dbl>
1 Alameda 2007 Jan 743100 708300 34800 4.7
2 Alameda 2007 Feb 744800 711000 33800 4.5
3 Alameda 2007 Mar 746600 713200 33300 4.5
4 Alameda 2007 Apr 738200 705800 32400 4.4
5 Alameda 2007 May 739100 707300 31800 4.3
6 Alameda 2007 Jun 744900 709100 35800 4.8
7 Alameda 2007 Jul 749600 710900 38700 5.2
8 Alameda 2007 Aug 746700 709600 37000 5.0
9 Alameda 2007 Sep 748200 712100 36000 4.8
10 Alameda 2007 Oct 749000 713000 36100 4.8
# ... with 6,950 more rows
I thought first I should change all the factor
columns to character
columns.
unemp[sapply(unemp, is.factor)] <- lapply(unemp[sapply(unemp, is.factor)], as.character)
claimants[sapply(claimants, is.factor)] <- lapply(claimants[sapply(claimants, is.factor)], as.character)
m <-merge(unemp, claimants, by = c("County", "Month", "Year"))
dim(m)
[1] 0 10
In the output of dim(m)
, 0 rows are in the resulting dataframe. All the 6960 rows should match each other uniquely.
To verify that the two data frames have unique combination of the the 3 columns 'County', 'Month', and 'Year' I reorder and rearrange these columns within the dataframes as below:
a <- claimants[ order(claimants[,"County"], claimants[,"Month"], claimants[,"Year"]), ]
b <- unemp[ order(unemp[,"County"], unemp[,"Month"], unemp[,"Year"]), ]
b[2:4] <- b[c(2,4,3)]
a[2:4] %in% b[2:4]
[1] TRUE TRUE TRUE
This last output confirms that all 'County', 'Month', and 'Year' columns match each other in these two dataframes.
I have tried looking into the documentation for merge
and could not gather where do I go wrong, I have also tried the inner_join
function from dplyr
:
> m <- inner_join(unemp[2:8], claimants[2:5])
Joining, by = c("County", "Year", "Month")
> dim(m)
[1] 0 8
I am missing something and don't know what, would appreciate the help with understanding this, I know I should not have to rearrange the rows by the three columns to run merge
R should identify the matching rows and merge the non-matching columns.
Upvotes: 1
Views: 2057
Reputation: 350
The claimants df has the counties in all uppercase, the unemp df has them in lower case.
I used the options(stringsAsFactors = FALSE) when reading in your data. A few suggestions drop the X column in both, it doesn't seem useful.
options(stringsAsFactors = FALSE)
claims <- read.csv("claims.csv",header=TRUE)
claims$X <- NULL
unemp <- read.csv("unemp.csv",header=TRUE)
unemp$X <- NULL
unemp$County <- toupper(unemp$County)
m <- inner_join(unemp, claims)
dim(m)
# [1] 6960 8
Upvotes: 2