Reputation: 87

Extracting multiple substrings that come after certain characters in a string using stringi in R

I have a large dataframe in R that has a column that looks like this where each sentence is a row

data <- data.frame(
   datalist = c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
   "these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
   "anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",
   "while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations"),
   stringsAsFactors=FALSE)

I want to extract all the words that come after "wiki/" and put them in another column

So for the first row it should come out with "political_philosophy self-governance" The second row should look like "hierarchy free_association_(communism_and_anarchism)" The third row should be "state_(polity)" And the fourth row should be "anti-statism"

I definitely want to use stringi because it's a huge dataframe. Thanks in advance for any help.

I've tried

stri_extract_all_fixed(data$datalist, "wiki")[[1]]

but that just extracts the word wiki

Upvotes: 3

Answers (3)

Wiktor Stribiżew

Reputation: 627468

You may use

> trimws(gsub("wiki/(\\S+)|(?:(?!wiki/\\S).)+", " \\1", data$datalist, perl=TRUE))
[1] "political_philosophy  self-governance" 
[2] "stateless_society  hierarchy  free_association_(communism_and_anarchism)"
[3] "state_(polity)"                                                           
[4] "anti-statism"

See the online R code demo.

Details

wiki/(\\S+) - matches wiki/ and captures 1+ non-whitespace chars into Group 1
| - or
(?:(?!wiki/\\S).)+ - a tempered greedy token that matches any char, other than a line break char, 1+ occurrences, that does not start a wiki/+a non-whitespace char sequence.

If you need to get rid of redundant whitespace inside the result you may use another call to gsub:

> gsub("^\\s+|\\s+$|\\s+(\\s)", "\\1", gsub("wiki/(\\S+)|(?:(?!wiki/\\S).)+", " \\1", data$datalist, perl=TRUE))
[1] "political_philosophy self-governance"                                   
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"                                                         
[4] "anti-statism"

Upvotes: 1

divibisan

Reputation: 12165

You can do this with a regex. By using stri_match_ instead of stri_extract_ we can use parentheses to make matching groups that let us extract only part of the regex match. In the result below, you can see that each row of df gives a list item containing a data frame with the whole match in the first column and each matching group in the following columns:

match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
match

[[1]]
     [,1]                        [,2]                  
[1,] "wiki/political_philosophy" "political_philosophy"
[2,] "wiki/self-governance"      "self-governance"     

[[2]]
     [,1]                                              [,2]                                        
[1,] "wiki/stateless_society"                          "stateless_society"                         
[2,] "wiki/hierarchy"                                  "hierarchy"                                 
[3,] "wiki/free_association_(communism_and_anarchism)" "free_association_(communism_and_anarchism)"

[[3]]
     [,1]                  [,2]            
[1,] "wiki/state_(polity)" "state_(polity)"

[[4]]
     [,1]                [,2]          
[1,] "wiki/anti-statism" "anti-statism"

You can then use apply functions to make the data into any form you want:

match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
sapply(match, function(x) paste(x[,2], collapse = " "))

[1] "political_philosophy self-governance"                                  
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"                                                        
[4] "anti-statism"

Upvotes: 3

D Pinto

Reputation: 901

You can use a lookbehind in the regex.

library(dplyr)
library(stringi)

text <- c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
 "these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
 "anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",                                                                                                                               
 "while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations")

df <- data.frame(text, stringsAsFactors = FALSE)

df %>% 
  mutate(words = stri_extract_all(text, regex = "(?<=wiki\\/)\\S+"))

Upvotes: 2

Extracting multiple substrings that come after certain characters in a string using stringi in R

Answers (3)

Related Questions