XinDou
XinDou

Reputation: 37

Extract text between parentheses with suffix

Here is the exmple t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)' I want the output is the information in '()*' In this exmple is Shanghai Chart Center, Donghai Navigation Safety Administration of MOT

Upvotes: 1

Views: 206

Answers (3)

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

Define a pattern that starts with (, is followed by any characters except (or )(expressed as a negative character class [^)(]+) and closed by )*:

library(stringr)
str_extract_all(t, "\\([^)(]+\\)\\*")
[[1]]
[1] "(Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*"

You can get rid of the list structure with unlist()

Upvotes: 0

Konrad Rudolph
Konrad Rudolph

Reputation: 545528

To match only the contents of (…)*, the tricky part is to avoid matching two unrelated parenthetical groups (i.e. something like (…) … (…)*). The easiest way to accomplish this is to disallow closing parentheses inside the match:

stringr::str_match_all(t, r'{\(([^)]*)\)\*}')

Do note that this will fail for nested parentheses (( … ( … ) …)*). Regular expressions are fundamentally unsuited to parse nested content so if you require handling such a case, regular expressions are not the appropriate tool; you’ll need to use a context-free parser (which is a lot more complicated).

Upvotes: 2

Dan Chaltiel
Dan Chaltiel

Reputation: 8484

The key here is to use the non-greedy wildcard .*?, otherwise everything between the first ( and the last ) would be caught:

library(stringr)
t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)'
str_extract_all(t, "(\\(.*?\\)\\*?)")[[1]] %>% str_subset("\\*$")
#> [1] "(Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*"

Created on 2021-03-03 by the reprex package (v1.0.0)

You can use the rev() function if you want to reverse the order and get it right to left.

This is far less elegant than I would like it but unexpectedly "(\\(.*?\\)\\*)" is not non-greedy, so I had to detect it at the end of the string. You can add %>% str_remove_all("\\*$") if you want to discard the star in the end string.

Upvotes: 1

Related Questions