Reputation: 23
I am attempting to use a regex to separate observations by a space as delimiter when that space is preceded by strings (LLC, SYSTEM INC, LIMITED PARTNERSHIP, SYSTEM, or SYSTEM PARTNERSHIP) and followed by a set string (LLC).
Data:
library(tidyverse)
data <- tibble(c("600 W JAX LLC 600 WJAX LLC CT CORPORATION SYSTEM INC 600 WJAX LLC", "BRICK MORTAR LIMITED PARTNERSHIP BRICK & MORTAR PROPERTY, LLC C T CORPORATION
SYSTEM BRICK & MORTAR PROPERTY, LLC", "C T CORPORATION SYSTEM JM FITZGERALD GP LLC C T CORPORATION SYSTEM PARTNERSHIP EHDOC J. MICHAEL FITZGERALD APARTMENTS LIMITED LLC"))
What I have done:
Attempt 1:
(?<=(?= (LLC|LIMITED PARTNERSHIP|SYSTEM(?: INC| PARTNERSHIP)?))) (?=.*(LLC))
https://regex101.com/r/3j5Jzx/1
Problem: matches on the space preceding the strings
I have also attempted to shift the space in the preceding argument for attempt 1.
(?<= (?=(LLC|LIMITED PARTNERSHIP|SYSTEM(?: INC| PARTNERSHIP)?))) (?=.*(LLC))
The regex returns no matches.
https://regex101.com/r/yc73MY/1
Attempt 2:
(?:(LLC|LIMITED PARTNERSHIP|SYSTEM(?: INC| PARTNERSHIP)?)) (?=.*(LLC))
https://regex101.com/r/jIJeEw/1
Problem: removes the preceding strings, but does match on the space following those strings.
Desired Output:
600 W JAX LLC
600 WJAX LLC
CT CORPORATION SYSTEM INC
600 WJAX LLC
BRICK MORTAR LIMITED PARTNERSHIP
BRICK & MORTAR PROPERTY, LLC
C T CORPORATION SYSTEM
BRICK & MORTAR PROPERTY, LLC
C T CORPORATION SYSTEM
JM FITZGERALD GP LLC
C T CORPORATION SYSTEM PARTNERSHIP
EHDOC J. MICHAEL FITZGERALD APARTMENTS LIMITED LLC
Any help would be much appreciated!
Upvotes: 0
Views: 60
Reputation: 2829
You can use:
stringr::str_split(
gsub(
"(LLC|SYSTEM INC|LIMITED PARTNERSHIP|SYSTEM|SYSTEM PARTNERSHIP) +",
"\\1###",
unlist(data)
),
"###"
)
This works by matching your list of preceeding strings (LLC|SYSTEM INC|LIMITED PARTNERSHIP|SYSTEM|SYSTEM PARTNERSHIP)
before some spaces +
and capturing your preceeding strings to group 1. This is quite handy as you can simply expand the list as needed.
It is then replaced with \\1###
, keeping your preceeding string but removing the desired space and inserting a string that does not occur in the data (###
). This is then used to split the string in str_split()
.
Upvotes: 0