Reputation: 7312
I have a data frame that looks like this:
df1 <- data.frame(Question=c("This is the start", "of a question", "This is a second", "question"),
Answer = c("Yes", "", "No", ""))
Question Answer
1 This is the start Yes
2 of a question
3 This is a second No
4 question
This is dummy data, but the real data is being pulled from PDF via tabulizer
. Any time there is a line break in Question
in the source document, that question gets split into multiple lines. How do I concatenate back based on the condition that Answer
is blank?
The desired result is simply:
Question Answer
1 This is the start of a question Yes
2 This is a second question No
The logic is simply, if Answer[x]
is blank, concatenate Question[x]
and Question[x-1]
and remove row x
.
Upvotes: 2
Views: 1369
Reputation: 147
..another (very similar) approach using dplyr
require(dplyr)
df1 %>% mutate(id = cumsum(!df1$Answer %in% c('Yes', 'No')),
Q2 = ifelse(Answer == "", paste(lag(Question), Question), ""),
A2 = ifelse(Answer == "", as.character(lag(Answer)), "")) %>%
filter(Q2 != "") %>%
select(id, Question = Q2, Answer = A2)
Upvotes: 0
Reputation: 13274
The following should do, if I follow your logic:
# test data
dff <- data.frame(Question=c("This is the start",
"of a question",
"This is a second",
"question",
"This is a third",
"question",
"and more space",
"yet even more space",
"This is actually another question"),
Answer = c("Yes",
"",
"No",
"",
"Yes",
"",
"",
"",
"No"),
stringsAsFactors = F)
# solution
do.call(rbind, lapply(split(dff, cumsum(nchar(dff$Answer)>0)), function(x) {
data.frame(Question=paste0(x$Question, collapse=" "), Answer=head(x$Answer,1))
}))
# Question Answer
# 1 This is the start of a question Yes
# 2 This is a second question No
# 3 This is a third question and more space yet even more space Yes
# 4 This is actually another question No
The idea is to use cumsum
on the expression nchar(dff$Answer)>0
. This should create a grouping vector to use with the split
function. Upon splitting on your grouping vector, you should be able to create smaller dataframes with the results of the split operation, by concatenating values from the Question
column and taking the first value of the Answer
column. Subsequently, you can rbind
the resulting dataframes.
I hope this helps.
Upvotes: 1
Reputation: 4534
This could no doubt be improved, but if you're happy to use the tidyverse
, perhaps an approach like this could work?
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(id = if_else(Answer != "", row_number(), NA_integer_)) %>%
fill(id) %>% group_by(id) %>%
summarise(Question = str_c(Question, collapse = " "), Answer = first(Answer))
#> # A tibble: 2 x 3
#> id Question Answer
#> <int> <chr> <fctr>
#> 1 1 This is the start of a question Yes
#> 2 3 This is a second question No
Upvotes: 4