Reputation: 19
tableA$area = c("太原市尖草坪区","安庆市宜秀区","北京市大兴区")
tableB$locate = c("尖草坪区", "宜秀区", "大兴区")
I need to combine tableA and tableB by c("area" = "locate"). How can I realize it?
Upvotes: 0
Views: 88
Reputation: 6769
Alternatively, you could also try something like this using @starja's tableA
and tableB
.
First, generate a unique identifiers (id
) using locate
in TableB
by removing 区 and 县. Then, find id
strings in tableA
and add a variable id
too. Finally, join tableA
with tableB
with by = "id"
.
library(tidyverse)
tableB$id = str_remove_all(tableB$locate, "区|县")
find <- paste(unique(tableB$id), collapse="|")
tableA$id <- str_extract_all(tableA$area, find, simplify = TRUE)[,1]
left_join(tableA, tableB, by = "id")
Upvotes: 1
Reputation: 10375
Assuming that tableB$locate
always contains the part of tableA$area
after 市, you can use a regex to match everything after 市 and use this for the join:
tableA <- data.frame(
other_col = 1:3,
area = c("太原市尖草坪区","安庆市宜秀区","北京市大兴区")
)
tableB <- data.frame(
type_col = c("A", "B", "C"),
locate = c("尖草坪区", "宜秀区", "大兴区")
)
library(dplyr)
library(stringr)
merged_table <- tableA %>%
mutate(area_wo_city = str_extract(area, "(?<=市).*$")) %>%
left_join(tableB, by = c("area_wo_city" = "locate"))
merged_table
other_col area area_wo_city type_col
1 1 太原市尖草坪区 尖草坪区 A
2 2 安庆市宜秀区 宜秀区 B
3 3 北京市大兴区 大兴区 C
Your new data makes it harder, because now the regex has to allow to match other characters in area
that are not contained in locate
. Therefore, I create a regex pattern for every entry in locate
that matches other characters in between. Then, I use these patterns to see in which row of tableA
it matches and join based on this information. The solution works for your provided data, but I think it is quite hacky and could break for more complex situations. Also, I assume that 桓/恒 is a typo, otherwise I don't know any solutions in R to account for it.
tableA <- data.frame(
other_col = 11:15,
area = c("太原市尖草坪区","安庆市宜秀区","北京市大兴区","桓仁满族自治县","银川市西夏区")
)
tableB <- data.frame(
type_col = c("A", "B", "C", "D", "E"),
locate = c("尖草坪区", "宜秀区", "大兴区","桓仁县","银川市西夏区")
)
library(dplyr)
library(stringr)
generate_pattern <- function(x) {
pattern <- str_extract_all(x, boundary("character"))[[1]]
pattern <- paste0(pattern, ".*")
pattern <- paste0(pattern, collapse = "")
pattern <- substr(pattern, start = 1, stop = nchar(pattern) - 2)
pattern
}
# generate the regex pattern and search in which row in tableA it matches
tableB <- tableB %>%
mutate(pattern = purrr::map_chr(locate, generate_pattern),
match_in_A = str_which(tableA$area, pattern))
tableB
type_col locate pattern match_in_A
1 A 尖草坪区 <U+00BC>.*a.*2.*Y.*<U+00C6>.*o.*<U+00C7>.*<U+00F8> 1
2 B 宜秀区 ò.*<U+00CB>.*D.*<U+00E3>.*<U+00C7>.*<U+00F8> 2
3 C 大兴区 ′.*ó.*D.*<U+00CB>.*<U+00C7>.*<U+00F8> 3
4 D 桓仁县 <U+00BB>.*<U+00B8>.*è.*ê.*<U+00CF>.*<U+00D8> 4
5 E 银川市西夏区 ò.*<U+00F8>.*′.*¨.*ê.*D.*<U+00CE>.*÷.*<U+00CF>.*<U+00C4>.*<U+00C7>.*<U+00F8> 5
# merge based on the matched row numbers
merged_table <- tableA %>%
tibble::rowid_to_column() %>%
left_join(tableB, by = c("rowid" = "match_in_A")) %>%
select(-c(rowid, pattern))
merged_table
other_col area type_col locate
1 11 太原市尖草坪区 A 尖草坪区
2 12 安庆市宜秀区 B 宜秀区
3 13 北京市大兴区 C 大兴区
4 14 桓仁满族自治县 D 桓仁县
5 15 银川市西夏区 E 银川市西夏区
Upvotes: 3