Reputation: 1979
I have a character variable. I need to extract the information from the title=""
value. Basically, I need all the values inside of ""
right after the title=
.
Here is the example dataset:
df <- data.frame(
id = c(1,2,3),
character = c('mrow><mn>2<mn><mi>h<mi><m title="h+r=2"><mstyle',
'mrow><mn>2<mn><mi>h<mi><m title="r+2h=h"><mstyle&',
'mrow><mn>2<mn><mi>h<mi><m title="h∙rleft(frac{2h}{2}right)"><mstyle>'))
> df
id character
1 1 mrow><mn>2<mn><mi>h<mi><m title="h+r=2"><mstyle
2 2 mrow><mn>2<mn><mi>h<mi><m title="r+2h=h"><mstyle&
3 3 mrow><mn>2<mn><mi>h<mi><m title="h·rleft(frac{2h}{2}right)"><mstyle>
My desired output would be:
> df
id character
1 1 h+r=2
2 2 r+2h=h
3 3 h·rleft(frac{2h}{2}right)
Upvotes: 1
Views: 837
Reputation: 1080
You should use regex101 to create a fitting regular expression:
https://regex101.com/r/OFJhnQ/1
Then you can use str_extract
to obtain the value.
Or you use the extract
function from tidyr:
df %>% tidyr::extract(character, "title", regex="title=\"(.+)\"")
Upvotes: 1
Reputation: 8844
Try this
library(dplyr)
df %>% mutate(character = sub(".+title=\"(.+)\".+", "\\1", character))
Upvotes: 1