Kasia Kulma
Kasia Kulma

Reputation: 1722

R: transforming specific rows into columns

I have quite messy data imported from json file, it looks like this:

raw_df <- data.frame(text = c(paste0('text', 1:3), '---------- OUTCOME LINE ----------', paste0('text', 4:6), '---------- OUTCOME LINE ----------'),
                              demand = c('cat1', rep('', 2), 'info', 'cat2', rep('', 2), 'info2')
                     )



raw_df
                                text demand
1                              text1   cat1
2                              text2       
3                              text3       
4 ---------- OUTCOME LINE ----------   info
5                              text4   cat2
6                              text5       
7                              text6       
8 ---------- OUTCOME LINE ----------  info2

(BTW, ---------- OUTCOME LINE ---------- is the actual string that I have in text column)

I want to tidy it up so that it has the following format:

final_df
                  text demand outcome
1 text1. text2. text3.   cat1   info1
2 text4. text5. text6.   cat2   info2

What would be the quickest and most efficient way to do this? thanks for the tips.

Upvotes: 4

Views: 110

Answers (2)

leerssej
leerssej

Reputation: 14958

A dplyr & tidyr solution:

raw_df %>% 
    mutate(outcome = demand,
           demand = replace(demand, demand == '', NA),
           outcome = replace(outcome, outcome == '', NA),
           outcome = gsub("^cat\\d+", NA, outcome)) %>% 
    fill(demand) %>% 
    fill(outcome, .direction = "up") %>% 
    filter(!grepl("-----", text)) %>%
    group_by(demand, outcome) %>% 
    summarize(text = gsub(",", "\\.", toString(text))) %>% 
    select(text, everything())
  • fix up the text to display as desired, swap out blanks for NAs, and prep the outcome column.

  • fill the demand column in the default down direction, and the outcome column in an up direction.

  • filter out the ----- OUTCOME LINE ------ based upon its hyphens.

  • generate the group_concat for the text column, and then swap the default , out with ..

  • select the columns into the desired sequence.

# A tibble: 2 x 3
# Groups:   demand [2]
                 text demand outcome
                <chr> <fctr>   <chr>
1 text1. text2. text3   cat1    info
2 text4. text5. text6   cat2   info2

Upvotes: 2

akrun
akrun

Reputation: 887153

Here we create a logical index using 'grepl' based on the presence of - in the 'text' column, subset the 'raw_df' to remove those rows, create a grouping column by getting cumulative sum of 'indx', aggregate to paste the 'text' column grouped by 'demand' after replacing the '' with NA and using na.locf to fill the non-NA previous values. Then, create the 'outcome' from the 'demand' by subsetting with 'indx'

indx <- grepl("-", raw_df$text)
transform(aggregate(text~demand, transform(raw_df[!indx,], 
  demand = zoo::na.locf(replace(demand, demand=="", NA))), toString),
    outcome = raw_df$demand[indx])
#  demand                text outcome
#1   cat1 text1, text2, text3    info
#2   cat2 text4, text5, text6   info2

Or this can be done with data.table

library(data.table)
setDT(raw_df)[demand == "", demand := NA][!indx, .(text= paste(text, collapse='. ')),
          .(demand = zoo::na.locf(demand))][, outcome := raw_df$demand[indx]][]

Upvotes: 1

Related Questions