Reputation: 7385
I have free-text data with numerous references to specific questions, and I'd like to organize it as shown below.
I'm able to create columns that note mentions of a certain topic (if the respondent references it by number), but I'd like a way to extract all of the text following the number, until another number is encountered.
Thanks in advance for any help!
library(tidyverse, warn.conflicts = F)
# Data
df <- data.frame(comment = c("topic 1: this is fine. 4 this is fine too. #9 not so good", "1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea"))
# I can identify the mentions if a respondent specifies the number they are responding to
df <- df %>%
mutate(mention = map(str_extract_all(comment, "[0-9]+"), as.numeric)) %>%
unnest_wider(col = mention, names_sep = "_")
# Ideal output
df_ideal <- structure(list(comment = c("topic 1: this is fine. 4 this is fine too. #9 not so good",
"1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea"
), mention_1 = c(1, 1), mention_2 = c(4, 17), mention_3 = c(9,
25), mention_4 = c(NA, 43), comment_1 = c("1: this is fine.",
"1 ok this is fine."), comment_2 = c("4 this is fine too.", "17 i do not like this idea."
), comment_3 = c("9 not so good", "25 great idea"), comment_4 = c(NA,
"42 nice idea")), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
Created on 2021-04-18 by the reprex package (v2.0.0)
Upvotes: 1
Views: 92
Reputation: 388817
You can continue without your extract and unnest_wider
approach.
library(tidyverse)
df %>%
mutate(mention = map(str_extract_all(comment, "[0-9]+"), as.numeric),
new_comment = str_extract_all(comment, '\\d+.*?(?=\\d|$)')) %>%
unnest_wider(col = new_comment, names_sep = "_") %>%
unnest_wider(col = mention, names_sep = "_")
# comment
#1 topic 1: this is fine. 4 this is fine too. #9 not so good
#2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea
# mention_1 mention_2 mention_3 mention_4 new_comment_1
#1 1 4 9 NA 1: this is fine.
#2 1 17 25 43 1 ok this is fine.
# new_comment_2 new_comment_3 new_comment_4
#1 4 this is fine too. # 9 not so good <NA>
#2 17 i do not like this idea. 25 great idea 43 cool idea
Upvotes: 1
Reputation: 79188
A base R approach is just to read the data in again:
read.table(text = gsub("(\\d+)","&\\1",df$comment), sep = "&", fill = TRUE,
comment.char = "", header = FALSE, strip.white = TRUE, na.strings = "")[,-1]
V2 V3 V4 V5
1 1: this is fine. 4 this is fine too. # 9 not so good <NA>
2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea
Upvotes: 1
Reputation: 886938
An option with strsplit
to split at one or more space (\\s+
) that follows a dot (\\.
- .
- metacharacters are escaped), and precedes a digit or #
(regex lookaround), then we loop over the output list
with lapply
, remove any characters that are not digits (\\D+
) from the start (^
) of each of the string with sub
, rbind
the list
elements and assign the 'comment_' columns into the original dataset 'df'
df[paste0('comment_', 1:3)] <- do.call(rbind, lapply(strsplit(df$comment,
"(?<=\\.)\\s+(?=[0-9#])", perl = TRUE), function(x) sub("^\\D+", "", x)))
-output
df
# A tibble: 2 x 7
comment mention_1 mention_2 mention_3 comment_1 comment_2 comment_3
<chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 topic 1: this is fine. 4 this is fine too. #9 not so good 1 4 9 1: this is fine. 4 this is fine too. 9 not so good
2 1 ok this is fine. 17 i do not like this idea. 25 great idea 1 17 25 1 ok this is fine. 17 i do not like this idea. 25 great idea
If the length
differ (as in the updated example), we can pad NA
at the end based on the max
lengths
from the list
to make the list
elements equal in length
before doing the rbind
lst1 <- lapply(strsplit(df$comment,
"(topic \\d+)(*SKIP)(*F)|\\s+(?=[0-9#])", perl = TRUE),
function(x) sub("^\\D+", "", x))
mx <- max(lengths(lst1))
df[paste0('comment_', seq_len(mx))] <- do.call(rbind,
lapply(lst1, `length<-`, mx))
-output
df
# A tibble: 2 x 9
comment mention_1 mention_2 mention_3 mention_4 comment_1 comment_2 comment_3 comment_4
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 topic 1: this is fine. 4 this is fine too. #9 not so good 1 4 9 NA 1: this is fine. 4 this is fine too. 9 not so go… <NA>
2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 co… 1 17 25 43 1 ok this is fine. 17 i do not like this id… 25 great id… 43 cool id…
>
Upvotes: 1