Reputation: 1722
I have quite messy data imported from json file, it looks like this:
raw_df <- data.frame(text = c(paste0('text', 1:3), '---------- OUTCOME LINE ----------', paste0('text', 4:6), '---------- OUTCOME LINE ----------'),
demand = c('cat1', rep('', 2), 'info', 'cat2', rep('', 2), 'info2')
)
raw_df
text demand
1 text1 cat1
2 text2
3 text3
4 ---------- OUTCOME LINE ---------- info
5 text4 cat2
6 text5
7 text6
8 ---------- OUTCOME LINE ---------- info2
(BTW, ---------- OUTCOME LINE ----------
is the actual string that I have in text
column)
I want to tidy it up so that it has the following format:
final_df
text demand outcome
1 text1. text2. text3. cat1 info1
2 text4. text5. text6. cat2 info2
What would be the quickest and most efficient way to do this? thanks for the tips.
Upvotes: 4
Views: 110
Reputation: 14958
A dplyr
& tidyr
solution:
raw_df %>%
mutate(outcome = demand,
demand = replace(demand, demand == '', NA),
outcome = replace(outcome, outcome == '', NA),
outcome = gsub("^cat\\d+", NA, outcome)) %>%
fill(demand) %>%
fill(outcome, .direction = "up") %>%
filter(!grepl("-----", text)) %>%
group_by(demand, outcome) %>%
summarize(text = gsub(",", "\\.", toString(text))) %>%
select(text, everything())
fix up the text to display as desired, swap out blanks for NA
s, and prep the outcome column.
fill
the demand
column in the default down direction, and the outcome column in an up direction.
filter
out the ----- OUTCOME LINE ------
based upon its hyphens.
generate the group_concat
for the text
column, and then swap the default ,
out with .
.
select
the columns into the desired sequence.
# A tibble: 2 x 3 # Groups: demand [2] text demand outcome <chr> <fctr> <chr> 1 text1. text2. text3 cat1 info 2 text4. text5. text6 cat2 info2
Upvotes: 2
Reputation: 887153
Here we create a logical index using 'grepl' based on the presence of -
in the 'text' column, subset the 'raw_df' to remove those rows, create a grouping column by getting cumulative sum of 'indx', aggregate
to paste
the 'text' column grouped by 'demand' after replacing the ''
with NA
and using na.locf
to fill the non-NA previous values. Then, create the 'outcome' from the 'demand' by subsetting with 'indx'
indx <- grepl("-", raw_df$text)
transform(aggregate(text~demand, transform(raw_df[!indx,],
demand = zoo::na.locf(replace(demand, demand=="", NA))), toString),
outcome = raw_df$demand[indx])
# demand text outcome
#1 cat1 text1, text2, text3 info
#2 cat2 text4, text5, text6 info2
Or this can be done with data.table
library(data.table)
setDT(raw_df)[demand == "", demand := NA][!indx, .(text= paste(text, collapse='. ')),
.(demand = zoo::na.locf(demand))][, outcome := raw_df$demand[indx]][]
Upvotes: 1