Reputation: 174
My problem looks like this:
data_example <-
c("Creditshelf Aktiengesellschaft / Key word(s): Forecast/Development of Sales\n\ncreditshelf Aktiengesellschaft",
"Swiss Life Holding AG / Key word(s): 9 Month figures\n\nSwiss Life increases fee income by 13%",
"tonies SE / Key word(s): Capital Increase\n\ntonies SE: tonies successfully places 12,000,000 new class A shares",
"init innovation in traffic systems SE / Key word(s): Contract/Incoming Orders\n\ninit innovation in traffic systems SEs")
strings_to_extract <-
c("Key word(s): Word1/Word2",
"Key word(s): Word1/Word2 Word3",
"Key word(s): Word1 Word2 Word3",
"Key word(s): Word1/Word2/Word3",
"Key word(s): Number Word1/Word2",
"Key word(s): Number Word1 Word2",
"Key word(s): Word1 Number Word2")
There will always be a whitespace or a "/" to separate them. My try looks like this:
str_extract(data, "Key word[[:punct:]]{1}s[[:punct:]]{2} [[:alpha:]]{1,}|Key word[[:punct:]]{1}s[[:punct:]]{2} [[:alpha:]]{1,}[[:punct:]]{1,}[[:alpha:]]{1,}Key word[[:punct:]]{1}s[[:punct:]]{2} [[:alpha:]]{1,}[[:punct:]]{1,}[[:alpha:]]{1,}[[:punct:]]{1,}[[:alpha:]]{1,}")
I mean I capture a good part of theme, but I think its too complicated. Could somebody give me a advice how to do it better?
Thx amd KR
Upvotes: 0
Views: 51
Reputation: 71
If you don't want to include phrase "Key word(s): ", then you can do following:
data_example <-
c("Creditshelf Aktiengesellschaft / Key word(s): Forecast/Development of Sales\n\ncreditshelf Aktiengesellschaft",
"Swiss Life Holding AG / Key word(s): 9 Month figures\n\nSwiss Life increases fee income by 13%",
"tonies SE / Key word(s): Capital Increase\n\ntonies SE: tonies successfully places 12,000,000 new class A shares",
"init innovation in traffic systems SE / Key word(s): Contract/Incoming Orders\n\ninit innovation in traffic systems SEs")
stringr::str_extract(string = data_example,
pattern = '(?<=Key word\\(s\\): )[\\s\\S]+')
#> [1] "Forecast/Development of Sales\n\ncreditshelf Aktiengesellschaft"
#> [2] "9 Month figures\n\nSwiss Life increases fee income by 13%"
#> [3] "Capital Increase\n\ntonies SE: tonies successfully places 12,000,000 new class A shares"
#> [4] "Contract/Incoming Orders\n\ninit innovation in traffic systems SEs"
Upvotes: 0
Reputation: 3230
Your example data makes a different approach suitable as well, as your keywords always end at \n
.
In this case you could just do:
data_example <-
c("Creditshelf Aktiengesellschaft / Key word(s): Forecast/Development of Sales\n\ncreditshelf Aktiengesellschaft",
"Swiss Life Holding AG / Key word(s): 9 Month figures\n\nSwiss Life increases fee income by 13%",
"tonies SE / Key word(s): Capital Increase\n\ntonies SE: tonies successfully places 12,000,000 new class A shares",
"init innovation in traffic systems SE / Key word(s): Contract/Incoming Orders\n\ninit innovation in traffic systems SEs")
stringr::str_extract(data_example, "Key word\\(s\\):.+(?=\\n)")
#> [1] "Key word(s): Forecast/Development of Sales"
#> [2] "Key word(s): 9 Month figures"
#> [3] "Key word(s): Capital Increase"
#> [4] "Key word(s): Contract/Incoming Orders"
Key word\\(s\\):
matches Key word(s):
, and .+(?=\\n)
matches all characters: .+
which are succeeded by \n
: (?=\\n)
. Notice the double escapes (\\
) which are needed in R.
Upvotes: 1
Reputation: 627488
You can use
str_extract(data, "Key word\\(s\\):\\s*\\w+(?:\\W+\w+){1,2}")
See the regex demo.
Details:
Key word\(s\):
\s*
- zero or more whitespaces\w+
- one or more word chars(?:\W+\w+){1,2}
- one or two sequences of one or more non-word chars followed with one or more word chars.Upvotes: 3