Mako212
Mako212

Reputation: 7312

R: how do I concatenate a string broken into multiple lines?

I have a data frame that looks like this:

df1 <- data.frame(Question=c("This is the start", "of a question", "This is a second", "question"), 
  Answer = c("Yes", "", "No", ""))

           Question Answer
1 This is the start    Yes
2     of a question       
3  This is a second     No
4          question       

This is dummy data, but the real data is being pulled from PDF via tabulizer. Any time there is a line break in Question in the source document, that question gets split into multiple lines. How do I concatenate back based on the condition that Answer is blank?

The desired result is simply:

                     Question     Answer
1 This is the start of a question    Yes
2       This is a second question     No

The logic is simply, if Answer[x] is blank, concatenate Question[x] and Question[x-1] and remove row x.

Upvotes: 2

Views: 1369

Answers (3)

Marley
Marley

Reputation: 147

..another (very similar) approach using dplyr

require(dplyr)

df1 %>% mutate(id = cumsum(!df1$Answer %in% c('Yes', 'No')),
               Q2 = ifelse(Answer == "", paste(lag(Question), Question), ""),
               A2 = ifelse(Answer == "", as.character(lag(Answer)), "")) %>%
        filter(Q2 != "") %>%
        select(id, Question = Q2, Answer = A2)

Upvotes: 0

Abdou
Abdou

Reputation: 13274

The following should do, if I follow your logic:

# test data
dff <- data.frame(Question=c("This is the start",
                             "of a question",
                             "This is a second",
                             "question",
                             "This is a third",
                             "question",
                             "and more space",
                             "yet even more space",
                             "This is actually another question"),
                  Answer = c("Yes",
                             "",
                             "No",
                             "",
                             "Yes",
                             "",
                             "",
                             "",
                             "No"),
                  stringsAsFactors = F)


# solution
do.call(rbind, lapply(split(dff, cumsum(nchar(dff$Answer)>0)), function(x) {
  data.frame(Question=paste0(x$Question, collapse=" "), Answer=head(x$Answer,1))
}))


#                                                        Question Answer
# 1                             This is the start of a question    Yes
# 2                                   This is a second question     No
# 3 This is a third question and more space yet even more space    Yes
# 4                           This is actually another question     No

The idea is to use cumsum on the expression nchar(dff$Answer)>0. This should create a grouping vector to use with the split function. Upon splitting on your grouping vector, you should be able to create smaller dataframes with the results of the split operation, by concatenating values from the Question column and taking the first value of the Answer column. Subsequently, you can rbind the resulting dataframes.

I hope this helps.

Upvotes: 1

markdly
markdly

Reputation: 4534

This could no doubt be improved, but if you're happy to use the tidyverse, perhaps an approach like this could work?

library(dplyr)
library(tidyr)
library(stringr)

df1 %>% 
  mutate(id = if_else(Answer != "", row_number(), NA_integer_)) %>%
  fill(id) %>% group_by(id) %>%
  summarise(Question = str_c(Question, collapse = " "), Answer = first(Answer))

#> # A tibble: 2 x 3
#>      id                        Question Answer
#>   <int>                           <chr> <fctr>
#> 1     1 This is the start of a question    Yes
#> 2     3       This is a second question     No

Upvotes: 4

Related Questions