Reputation: 37
Here is the exmple
t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)'
I want the output is the information in '()*'
In this exmple is
Shanghai Chart Center, Donghai Navigation Safety Administration of MOT
Upvotes: 1
Views: 206
Reputation: 21400
Define a pattern that starts with (
, is followed by any characters except (
or )
(expressed as a negative character class [^)(]+
) and closed by )*
:
library(stringr)
str_extract_all(t, "\\([^)(]+\\)\\*")
[[1]]
[1] "(Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*"
You can get rid of the list structure with unlist()
Upvotes: 0
Reputation: 545528
To match only the contents of (…)*
, the tricky part is to avoid matching two unrelated parenthetical groups (i.e. something like (…) … (…)*
). The easiest way to accomplish this is to disallow closing parentheses inside the match:
stringr::str_match_all(t, r'{\(([^)]*)\)\*}')
Do note that this will fail for nested parentheses (( … ( … ) …)*
). Regular expressions are fundamentally unsuited to parse nested content so if you require handling such a case, regular expressions are not the appropriate tool; you’ll need to use a context-free parser (which is a lot more complicated).
Upvotes: 2
Reputation: 8484
The key here is to use the non-greedy wildcard .*?
, otherwise everything between the first (
and the last )
would be caught:
library(stringr)
t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)'
str_extract_all(t, "(\\(.*?\\)\\*?)")[[1]] %>% str_subset("\\*$")
#> [1] "(Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*"
Created on 2021-03-03 by the reprex package (v1.0.0)
You can use the rev()
function if you want to reverse the order and get it right to left.
This is far less elegant than I would like it but unexpectedly "(\\(.*?\\)\\*)"
is not non-greedy, so I had to detect it at the end of the string. You can add %>% str_remove_all("\\*$")
if you want to discard the star in the end string.
Upvotes: 1