Reputation: 7385

Extract number and following text and create multiple new columns in R

I have free-text data with numerous references to specific questions, and I'd like to organize it as shown below.

I'm able to create columns that note mentions of a certain topic (if the respondent references it by number), but I'd like a way to extract all of the text following the number, until another number is encountered.

Thanks in advance for any help!

library(tidyverse, warn.conflicts = F)

# Data
df <- data.frame(comment = c("topic 1: this is fine. 4 this is fine too. #9 not so good", "1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea"))

# I can identify the mentions if a respondent specifies the number they are responding to
df <- df %>% 
  mutate(mention = map(str_extract_all(comment, "[0-9]+"), as.numeric)) %>% 
  unnest_wider(col = mention, names_sep = "_")

# Ideal output
df_ideal <- structure(list(comment = c("topic 1: this is fine. 4 this is fine too. #9 not so good", 
"1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea"
), mention_1 = c(1, 1), mention_2 = c(4, 17), mention_3 = c(9, 
25), mention_4 = c(NA, 43), comment_1 = c("1: this is fine.", 
"1 ok this is fine."), comment_2 = c("4 this is fine too.", "17 i do not like this idea."
), comment_3 = c("9 not so good", "25 great idea"), comment_4 = c(NA, 
"42 nice idea")), row.names = c(NA, -2L), class = c("tbl_df", 
"tbl", "data.frame"))

^{Created on 2021-04-18 by the reprex package (v2.0.0)}

Upvotes: 1

Answers (3)

Ronak Shah

Reputation: 388817

You can continue without your extract and unnest_wider approach.

library(tidyverse)


df %>%
  mutate(mention = map(str_extract_all(comment, "[0-9]+"), as.numeric), 
         new_comment = str_extract_all(comment, '\\d+.*?(?=\\d|$)')) %>%
  unnest_wider(col = new_comment, names_sep = "_") %>%
  unnest_wider(col = mention, names_sep = "_")

#                                                                    comment
#1                 topic 1: this is fine. 4 this is fine too. #9 not so good
#2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea

#  mention_1 mention_2 mention_3 mention_4       new_comment_1
#1         1         4         9        NA   1: this is fine. 
#2         1        17        25        43 1 ok this is fine. 

#                 new_comment_2  new_comment_3 new_comment_4
#1        4 this is fine too. #  9 not so good          <NA>
#2 17 i do not like this idea.  25 great idea   43 cool idea

Upvotes: 1

Onyambu

Reputation: 79188

A base R approach is just to read the data in again:

read.table(text = gsub("(\\d+)","&\\1",df$comment), sep = "&", fill = TRUE,
           comment.char = "", header = FALSE, strip.white = TRUE, na.strings = "")[,-1]
                  V2                          V3            V4           V5
1   1: this is fine.       4 this is fine too. # 9 not so good         <NA>
2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea

Upvotes: 1

akrun

Reputation: 886938

An option with strsplit to split at one or more space (\\s+) that follows a dot (\\. - . - metacharacters are escaped), and precedes a digit or # (regex lookaround), then we loop over the output list with lapply, remove any characters that are not digits (\\D+) from the start (^) of each of the string with sub, rbind the list elements and assign the 'comment_' columns into the original dataset 'df'

df[paste0('comment_', 1:3)] <- do.call(rbind, lapply(strsplit(df$comment, 
      "(?<=\\.)\\s+(?=[0-9#])", perl = TRUE), function(x) sub("^\\D+", "", x)))

-output

df
# A tibble: 2 x 7
  comment                                                      mention_1 mention_2 mention_3 comment_1          comment_2                   comment_3    
  <chr>                                                            <dbl>     <dbl>     <dbl> <chr>              <chr>                       <chr>        
1 topic 1: this is fine. 4 this is fine too. #9 not so good            1         4         9 1: this is fine.   4 this is fine too.         9 not so good
2 1 ok this is fine. 17 i do not like this idea. 25 great idea         1        17        25 1 ok this is fine. 17 i do not like this idea. 25 great idea

Update

If the length differ (as in the updated example), we can pad NA at the end based on the max lengths from the list to make the list elements equal in length before doing the rbind

lst1 <- lapply(strsplit(df$comment, 
   "(topic \\d+)(*SKIP)(*F)|\\s+(?=[0-9#])", perl = TRUE), 
         function(x) sub("^\\D+", "", x))
mx <- max(lengths(lst1))
df[paste0('comment_', seq_len(mx))] <- do.call(rbind,
           lapply(lst1, `length<-`, mx))

-output

df
# A tibble: 2 x 9
  comment                                                             mention_1 mention_2 mention_3 mention_4 comment_1          comment_2                 comment_3    comment_4  
  <chr>                                                                   <dbl>     <dbl>     <dbl>     <dbl> <chr>              <chr>                     <chr>        <chr>      
1 topic 1: this is fine. 4 this is fine too. #9 not so good                   1         4         9        NA 1: this is fine.   4 this is fine too.       9 not so go… <NA>       
2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 co…         1        17        25        43 1 ok this is fine. 17 i do not like this id… 25 great id… 43 cool id…
>

Upvotes: 1

Extract number and following text and create multiple new columns in R

Answers (3)

Update

Related Questions