Reputation: 982
I wish to split institution names from addresses in a vector. My heuristic is that the address is the right-hand side of the string starting from the first comma followed by a substring containing a digit.
So the raw data looks like this:
a <- c("CATHOLIC UNIV KOREA, COLL MED, DEPT LAB MED, SEOUL, SOUTH KOREA",
"UNIV ULSAN, DEPT LAB MED, COLL MED, 88 OLYMP RO 43 GIL, SEOUL 05505, SOUTH KOREA",
"UNIV ULSAN, DEPT INTERNAL MED, COLL MED, SEOUL, SOUTH KOREA",
"ASAN MED CTR, 88 OLYMP RO 43 GIL, SEOUL 05505, SOUTH KOREA",
"YONSEI UNIV, COLL MED, SEVERANCE HOSP, DEPT LAB MED, 50 YONSEI RO, SEOUL 03722, SOUTH KOREA",
"KWANGWOON UNIV, DEPT ELECT MAT ENGN, SEOUL 139701, SOUTH KOREA",
"YG 1 CO LTD, 68 CHONGCHON DONG, INCHEON 430030, SOUTH KOREA")
And I want:
"CATHOLIC UNIV KOREA, COLL MED, DEPT LAB MED, SEOUL, SOUTH KOREA" ""
"UNIV ULSAN, DEPT LAB MED, COLL MED" "88 OLYMP RO 43 GIL, SEOUL 05505, SOUTH KOREA"
"UNIV ULSAN, DEPT INTERNAL MED, COLL MED, SEOUL, SOUTH KOREA" ""
"ASAN MED CTR" "88 OLYMP RO 43 GIL, SEOUL 05505, SOUTH KOREA"
"YONSEI UNIV, COLL MED, SEVERANCE HOSP, DEPT LAB MED" "50 YONSEI RO, SEOUL 03722, SOUTH KOREA"
"KWANGWOON UNIV, DEPT ELECT MAT ENGN" "SEOUL 139701, SOUTH KOREA"
"" "YG 1 CO LTD, 68 CHONGCHON DONG, INCHEON 430030, SOUTH KOREA"
EDIT 1: I've reworded my question in a more systematic way.
EDIT 2: The digit may occur before the first delimiter: I have added such a value at the end of the example data.
Upvotes: 0
Views: 72
Reputation: 982
#extract left-hand side (name)
al <- trimws(sub("[[:punct:]]+$", "", sub("(^|[^0-9]+,)([^,]+[0-9].*)$", "\\1", a)))
[1] "CATHOLIC UNIV KOREA, COLL MED, DEPT LAB MED, SEOUL, SOUTH KOREA"
[2] "UNIV ULSAN, DEPT LAB MED, COLL MED"
[3] "UNIV ULSAN, DEPT INTERNAL MED, COLL MED, SEOUL, SOUTH KOREA"
[4] "ASAN MED CTR"
[5] "YONSEI UNIV, COLL MED, SEVERANCE HOSP, DEPT LAB MED"
[6] "KWANGWOON UNIV, DEPT ELECT MAT ENGN"
[7] ""
#extract right-hand side (address)
ar <- ifelse(grepl("[0-9]", a), trimws(sub("(^|[^0-9]+,)([^,]+[0-9].*)$", "\\2", a)), "")
[1] ""
[2] "88 OLYMP RO 43 GIL, SEOUL 05505, SOUTH KOREA"
[3] ""
[4] "88 OLYMP RO 43 GIL, SEOUL 05505, SOUTH KOREA"
[5] "50 YONSEI RO, SEOUL 03722, SOUTH KOREA"
[6] "SEOUL 139701, SOUTH KOREA"
[7] "YG 1 CO LTD, 68 CHONGCHON DONG, INCHEON 430030, SOUTH KOREA"
#Combine names and addresses in a dataframe
data.frame(al, ar, stringsAsFactors=F)
EDIT 1: Now works also when digit appears before first comma, using same regex for both left and right hand-sides.
EDIT 2: Remove trailing comma from left-hand side.
Upvotes: 1
Reputation: 5788
Base R regex solution:
b <-
within(data.frame(
lhs = gsub("\\,\\s+\\d.*|\\,\\s+\\w+\\s+\\d+.*|^\\w+\\s+\\d+.*", "", a)
), {
rhs <-
sapply(seq_along(lhs), function(i) {
ifelse(grepl("\\d+", a[i]),
ifelse(lhs[i] == "", a[i],
gsub(paste0(lhs[i], ", "),
"",
a[i])), "")
})
})
Data:
a <- c("CATHOLIC UNIV KOREA, COLL MED, DEPT LAB MED, SEOUL, SOUTH KOREA",
"UNIV ULSAN, DEPT LAB MED, COLL MED, 88 OLYMP RO 43 GIL, SEOUL 05505, SOUTH KOREA",
"UNIV ULSAN, DEPT INTERNAL MED, COLL MED, SEOUL, SOUTH KOREA",
"ASAN MED CTR, 88 OLYMP RO 43 GIL, SEOUL 05505, SOUTH KOREA",
"YONSEI UNIV, COLL MED, SEVERANCE HOSP, DEPT LAB MED, 50 YONSEI RO, SEOUL 03722, SOUTH KOREA",
"KWANGWOON UNIV, DEPT ELECT MAT ENGN, SEOUL 139701, SOUTH KOREA",
"YG 1 CO LTD, 68 CHONGCHON DONG, INCHEON 430030, SOUTH KOREA")
Upvotes: 0