Reputation: 87
I have a large dataframe in R that has a column that looks like this where each sentence is a row
data <- data.frame(
datalist = c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
"these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
"anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",
"while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations"),
stringsAsFactors=FALSE)
I want to extract all the words that come after "wiki/" and put them in another column
So for the first row it should come out with "political_philosophy self-governance" The second row should look like "hierarchy free_association_(communism_and_anarchism)" The third row should be "state_(polity)" And the fourth row should be "anti-statism"
I definitely want to use stringi because it's a huge dataframe. Thanks in advance for any help.
I've tried
stri_extract_all_fixed(data$datalist, "wiki")[[1]]
but that just extracts the word wiki
Upvotes: 3
Views: 140
Reputation: 627468
You may use
> trimws(gsub("wiki/(\\S+)|(?:(?!wiki/\\S).)+", " \\1", data$datalist, perl=TRUE))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"
See the online R code demo.
Details
wiki/(\\S+)
- matches wiki/
and captures 1+ non-whitespace chars into Group 1|
- or(?:(?!wiki/\\S).)+
- a tempered greedy token that matches any char, other than a line break char, 1+ occurrences, that does not start a wiki/
+a non-whitespace char sequence.If you need to get rid of redundant whitespace inside the result you may use another call to gsub
:
> gsub("^\\s+|\\s+$|\\s+(\\s)", "\\1", gsub("wiki/(\\S+)|(?:(?!wiki/\\S).)+", " \\1", data$datalist, perl=TRUE))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"
Upvotes: 1
Reputation: 12165
You can do this with a regex. By using stri_match_
instead of stri_extract_
we can use parentheses to make matching groups that let us extract only part of the regex match. In the result below, you can see that each row of df
gives a list item containing a data frame with the whole match in the first column and each matching group in the following columns:
match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
match
[[1]]
[,1] [,2]
[1,] "wiki/political_philosophy" "political_philosophy"
[2,] "wiki/self-governance" "self-governance"
[[2]]
[,1] [,2]
[1,] "wiki/stateless_society" "stateless_society"
[2,] "wiki/hierarchy" "hierarchy"
[3,] "wiki/free_association_(communism_and_anarchism)" "free_association_(communism_and_anarchism)"
[[3]]
[,1] [,2]
[1,] "wiki/state_(polity)" "state_(polity)"
[[4]]
[,1] [,2]
[1,] "wiki/anti-statism" "anti-statism"
You can then use apply functions to make the data into any form you want:
match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
sapply(match, function(x) paste(x[,2], collapse = " "))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"
Upvotes: 3
Reputation: 901
You can use a lookbehind in the regex.
library(dplyr)
library(stringi)
text <- c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
"these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
"anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",
"while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations")
df <- data.frame(text, stringsAsFactors = FALSE)
df %>%
mutate(words = stri_extract_all(text, regex = "(?<=wiki\\/)\\S+"))
Upvotes: 2