Reputation: 123
I want to merge two data frames. The first one has less rows (2354), one unique ID, and two repeated ID's (CVE_MUN and CVE_ENT), these two are geographical identifiers.
# A tibble: 2,354 x 6
CLAVE NOMBRE CVE_ENT `ENTIDAD FEDERATIVA~ CVE_MUN `MUNICIPIO (INEG~
<chr> <chr> <dbl> <chr> <dbl> <chr>
1 03AGN-~ Aguascalien~ 1 AGUASCALIENTES 1 AGUASCALIENTES
2 03AGO-~ Agostaderito 1 AGUASCALIENTES 1 AGUASCALIENTES
3 03AGP-~ Aguascalien~ 1 AGUASCALIENTES 1 AGUASCALIENTES
4 03AGS-~ Aguascalien~ 1 AGUASCALIENTES 1 AGUASCALIENTES
5 03AIS-~ Alfa Nissan 1 AGUASCALIENTES 1 AGUASCALIENTES
6 03ALF-~ Nissan Mexi~ 1 AGUASCALIENTES 1 AGUASCALIENTES
7 03APT-~ Aguascalien~ 1 AGUASCALIENTES 1 AGUASCALIENTES
8 03ASS-~ Aguascalien~ 1 AGUASCALIENTES 1 AGUASCALIENTES
9 03ASU-~ Asuncion 1 AGUASCALIENTES 1 AGUASCALIENTES
10 03AYU-~ Ayuntamiento 1 AGUASCALIENTES 1 AGUASCALIENTES
So I have 26 different ID's (CLAVE) in a certain area (CVE_ENT:1, CVE_MUN: 1), 3 ID's in an other area (CVE_ENT:1, CVE_MUN:2), etc
CVE_ENT CVE_MUN Freq
1 1 26
1 2 3
1 3 2
1 5 3
1 6 1
1 7 1
1 9 1
1 10 2
1 11 2
I want to map all the ID's, so I am trying to merge it to a very detailed geographical data frame. This data frame has 299615 rows, each "MAPA" row has a unique long,lat coordinates.
MAPA CVE_ENT NOM_ENT NOM_ABR CVE_MUN NOM_MUN CVE_LOC
1 10010001 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 1
2 10010094 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 94
3 10010096 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 96
4 10010100 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 100
5 10010102 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 102
6 10010104 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 104
7 10010106 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 106
8 10010112 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 112
9 10010113 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 113
10 10010120 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 120
11 10010121 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 121
12 10010125 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 125
13 10010126 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 126
14 10010127 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 127
15 10010128 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 128
16 10010135 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 135
17 10010138 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 138
18 10010139 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 139
19 10010141 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 141
20 10010144 1 AGUASCALIENTES AGS. 1 AGUASCALIENTES 144
Again, I have CVE_ENT and CVE_MUN but this time they are repeated many more times than with my first data frame.
CVE_ENT CVE_MUN Freq
1 1 725
1 2 242
1 3 293
1 4 94
What I want is to merge df1 given CVE_ENT and CVE_MUN with df2 without repeating MAPA. That is, that "CLAVE" gets merge to a different "MAPA" given CVE_ENT and CVE_MUN. Ideally, df1 will merge the first 26 CVE_ENT=1 and CVE_MUN=1 to the first 26 CVE_ENT=1 and CVE_MUN=1 of df2.
So far I have tried left_join and filtering by "CLAVE" but I'm getting "MAPA" repeated so again, I don't have a unique geographical long, lat.
Coord<-left_join(df1,df2,by=c("CVE_ENT","CVE_MUN"))
DT <- Coord[!duplicated(Coord$CLAVE),]
My ideal final data frame looks like this:
CLAVE NOMBRE CVE_ENT `ENTIDAD FEDERATIVA~ CVE_MUN `MUNICIPIO (INEG~ MAPA
1 03AGN-~ Aguascalien~ 1 AGUASCALIENTES 1 AGUASCALIENTES 10010001
2 03AGO-~ Agostaderito 1 AGUASCALIENTES 1 AGUASCALIENTES 10010094
3 03AGP-~ Aguascalien~ 1 AGUASCALIENTES 1 AGUASCALIENTES 10010096
4 03AGS-~ Aguascalien~ 1 AGUASCALIENTES 1 AGUASCALIENTES 10010100
5 03AIS-~ Alfa Nissan 1 AGUASCALIENTES 1 AGUASCALIENTES 10010102
6 03ALF-~ Nissan Mexi~ 1 AGUASCALIENTES 1 AGUASCALIENTES 10010104
7 03APT-~ Aguascalien~ 1 AGUASCALIENTES 1 AGUASCALIENTES 10010106
8 03ASS-~ Aguascalien~ 1 AGUASCALIENTES 1 AGUASCALIENTES 10010112
9 03ASU-~ Asuncion 1 AGUASCALIENTES 1 AGUASCALIENTES 10010113
10 03AYU-~ Ayuntamiento 1 AGUASCALIENTES 1 AGUASCALIENTES 10010120
Upvotes: 0
Views: 243
Reputation: 3134
You can do that in two steps:
df1
and df2
(what a join
typically does)cbind()
or bind_cols()
, since you actually don't want to do a joinTo keep the matching subsets together, it's practical to use nesting.
library(tidyverse)
# Dummy data
df1 <- tibble(uniq_id = letters[1:10],
id1 = rep(1:3, times = c(5,3,2)),
id2 = rep(1:5, each = 2))
df2 <- tibble(id1 = rep(1:3, each = 15),
id2 = rep(1:5, times = 9),
coord = paste("coord",id1, id2, c(LETTERS, LETTERS[1:19]), sep="-"))
df1 %>%
nest_join(df2, by = c("id1", "id2")) %>%
nest(df1 = uniq_id) %>%
rowwise() %>%
mutate(cbound = list(bind_cols(df1, df2[1:nrow(df1), ]))) %>%
select(-df1, -df2) %>%
unnest(cbound)
But I do have doubts about why doing that: normally a join is well-defined since you have a particular criterion to match rows in df1 to df2. Here, why should AGN
have MAPA=10010001 and AGO
have MAPA=10010094, and not the opposite?
You can obtain the same result using a base R solution:
# Dummy data
df1 <- data.frame(uniq_id = letters[1:10],
id1 = rep(1:3, times = c(5,3,2)),
id2 = rep(1:5, each = 2))
df2 <- data.frame(id1 = rep(1:3, each = 15),
id2 = rep(1:5, times = 9))
df2$coord <- paste("coord",df2$id1, df2$id2, c(LETTERS, LETTERS[1:19]), sep="-")
parallel_merge <- function(xy){
x <- xy[1]
y <- xy[2]
df1_rows <- which(df1$id1 == x & df1$id2 == y)
if(length(df1_rows) == 0) return(NULL)
df2_rows <- which(df2$id1 == x & df2$id2 == y)
df2_rows <- df2_rows[1:length(df1_rows)]
cbind(df1[df1_rows,],
df2[df2_rows,])
}
combinations <- expand.grid(unique(df1$id1), unique(df2$id2))
coords_by_comb <- apply(combinations, 1, parallel_merge)
do.call(rbind, coords_by_comb)
The logic is slightly different: we first use expand.grid()
to generate every combination of id1
and id2
, then we can use apply()
to loop on these combinations (note that if it some combinations do not exist in df1
, this is not an optimal solution). For each combination, in the function parallel_merge()
, we find the subset of df1
to keep, the subset of df2
to keep and truncate it to match df1
. Then we just need to assemble everything.
Upvotes: 1