Shaq
Shaq

Reputation: 23

regex for separating by delimiter with multiple lookahead conditions and a single lookbehind condition

I am attempting to use a regex to separate observations by a space as delimiter when that space is preceded by strings (LLC, SYSTEM INC, LIMITED PARTNERSHIP, SYSTEM, or SYSTEM PARTNERSHIP) and followed by a set string (LLC).

Data:

    library(tidyverse)

    data <- tibble(c("600 W JAX LLC 600 WJAX LLC CT CORPORATION SYSTEM INC 600 WJAX LLC", "BRICK MORTAR LIMITED PARTNERSHIP BRICK & MORTAR PROPERTY, LLC C T CORPORATION 
    SYSTEM BRICK & MORTAR PROPERTY, LLC", "C T CORPORATION SYSTEM JM FITZGERALD GP LLC C T CORPORATION SYSTEM PARTNERSHIP EHDOC J. MICHAEL FITZGERALD APARTMENTS LIMITED LLC"))

What I have done:

Attempt 1:

    (?<=(?= (LLC|LIMITED PARTNERSHIP|SYSTEM(?: INC| PARTNERSHIP)?))) (?=.*(LLC))

https://regex101.com/r/3j5Jzx/1

Problem: matches on the space preceding the strings

I have also attempted to shift the space in the preceding argument for attempt 1.

    (?<= (?=(LLC|LIMITED PARTNERSHIP|SYSTEM(?: INC| PARTNERSHIP)?))) (?=.*(LLC))

The regex returns no matches.

https://regex101.com/r/yc73MY/1

Attempt 2:

    (?:(LLC|LIMITED PARTNERSHIP|SYSTEM(?: INC| PARTNERSHIP)?)) (?=.*(LLC))

https://regex101.com/r/jIJeEw/1

Problem: removes the preceding strings, but does match on the space following those strings.

Desired Output:

600 W JAX LLC

600 WJAX LLC

CT CORPORATION SYSTEM INC

600 WJAX LLC

BRICK MORTAR LIMITED PARTNERSHIP

BRICK & MORTAR PROPERTY, LLC

C T CORPORATION SYSTEM

BRICK & MORTAR PROPERTY, LLC

C T CORPORATION SYSTEM

JM FITZGERALD GP LLC

C T CORPORATION SYSTEM PARTNERSHIP

EHDOC J. MICHAEL FITZGERALD APARTMENTS LIMITED LLC

Any help would be much appreciated!

Upvotes: 0

Views: 60

Answers (1)

DuesserBaest
DuesserBaest

Reputation: 2829

You can use:

stringr::str_split(
  gsub(
    "(LLC|SYSTEM INC|LIMITED PARTNERSHIP|SYSTEM|SYSTEM PARTNERSHIP) +", 
    "\\1###", 
    unlist(data)
    ), 
  "###"
  )

This works by matching your list of preceeding strings (LLC|SYSTEM INC|LIMITED PARTNERSHIP|SYSTEM|SYSTEM PARTNERSHIP) before some spaces + and capturing your preceeding strings to group 1. This is quite handy as you can simply expand the list as needed.

It is then replaced with \\1###, keeping your preceeding string but removing the desired space and inserting a string that does not occur in the data (###). This is then used to split the string in str_split().

Upvotes: 0

Related Questions