codedancer
codedancer

Reputation: 1634

Merge rows by pattern in R

I am trying to merge rows by pattern.

The dataframe has only one column (string) and normally, it should follow a pattern of date, company_name and salary. However, some cases just don't have the salary.

Is there is a way I can merge the rows by the pattern of the date? By doing so, I can later split them into columns. The reason why I didn't want to do pivot_wider earlier was that it's likely to get mismatched between the company name and salary - unbalanced rows. So I think it's better to merge the rows by the date pattern as the date is never missing and following a pattern.

dataset:

# A tibble: 10 x 1
   detail                                        
   <chr>                                         
 1 26 January 2021                               
 2 NatWest Group - Bristol, BS2 0PT              
 3 26 January 2021                               
 4 NatWest Group - Manchester, M3 3AQ            
 5 15 February 2021                              
 6 Brook Street - Liverpool, Merseyside, L21AB   
 7 £13.84 per hour                               
 8 16 February 2021                              
 9 Anglo Technical Recruitment - London, WC2N 5DU
10 £400.00 per day   

dput for the dataset:

structure(list(detail = c("26 January 2021", "NatWest Group - Bristol, BS2 0PT", 
"26 January 2021", "NatWest Group - Manchester, M3 3AQ", "15 February 2021", 
"Brook Street - Liverpool, Merseyside, L21AB", "£13.84 per hour", 
"16 February 2021", "Anglo Technical Recruitment - London, WC2N 5DU", 
"£400.00 per day")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

Expected outcome:

 detail                                                                         
 <chr>                                                                          
1 26 January 2021 NatWest Group - Bristol, BS2 0PT                               
2 26 January 2021 NatWest Group - Manchester, M3 3AQ                             
3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour   
4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day

dput for expected outcome:

df <- structure(list(detail = c("26 January 2021 NatWest Group - Bristol, BS2 0PT", 
                          "26 January 2021 NatWest Group - Manchester, M3 3AQ", "15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour", 
                          "16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day")), row.names = c(NA, -4L), class = c("tbl_df", 
                                                                                  "tbl", "data.frame"))

Upvotes: 0

Views: 311

Answers (4)

AnilGoyal
AnilGoyal

Reputation: 26238

One more solution assuming only that first row contains a date. It'll work irrespective of the number of rows in between two dates..

library(tidyverse)
df %>% group_by(d = cumsum(str_detect(detail, "^(^\\d\\d? \\w+ \\d{4})$"))) %>%
       mutate(c = paste0("Col", as.character(row_number()))) %>%
  pivot_wider(id_cols = d, values_from = detail, names_from = c)

# A tibble: 4 x 4
# Groups:   d [4]
      d Col1             Col2                                           Col3           
  <int> <chr>            <chr>                                          <chr>          
1     1 26 January 2021  NatWest Group - Bristol, BS2 0PT               NA             
2     2 26 January 2021  NatWest Group - Manchester, M3 3AQ             NA             
3     3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB    £13.84 per hour
4     4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day

Upvotes: 1

Uwe
Uwe

Reputation: 42592

Here is a data.table approach which uses dcast() and rowid() to reshape to wide format. It returns a data.table with four columns: a record number, date, company_name, and salary.

library(data.table)
setDT(df1)[, rn := cumsum(!is.na(lubridate::dmy(detail)))]
dcast(df1, rn ~ rowid(rn, prefix = "Col"), value.var = "detail")
   rn             Col1                                           Col2            Col3
1:  1  26 January 2021               NatWest Group - Bristol, BS2 0PT            <NA>
2:  2  26 January 2021             NatWest Group - Manchester, M3 3AQ            <NA>
3:  3 15 February 2021    Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4:  4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day

For detecting rows which start a new record, i.e., rows with a date, this approach borrows from Anil's answer as well as from G.Grothendieck's.

dcast() allows to pack all into a "one-liner" (if the library() calls are not counted):

library(data.table)
library(lubridate)
dcast(setDT(df1), cumsum(!is.na(dmy(detail))) ~ rowid(cumsum(!is.na(dmy(detail))), prefix = "Col"), 
      value.var = "detail")

Upvotes: 0

Wimpel
Wimpel

Reputation: 27792

Here is a pure data.table approach

library( data.table )
#make it a data.table
setDT( df )
#first, summarise by block separated by days, collapse the text, using @@ as separator
ans <- df[, .( paste0( detail, collapse = "@@") ), 
          by = .(d = cumsum( ( grepl( "[0-9]{2} [a-zA-Z]+ [0-9]{4}", detail) ) ) ) ]
#split text again to cols, based on te @@ introduced in the collapse/ Number of cols is dynamic!
ans[, paste0( "Col", 1:length( tstrsplit(ans$V1, "@@" ))) := tstrsplit( V1, "@@" )][, V1 := NULL ][]

#    d             Col1                                           Col2            Col3
# 1: 1  26 January 2021               NatWest Group - Bristol, BS2 0PT            <NA>
# 2: 2  26 January 2021             NatWest Group - Manchester, M3 3AQ            <NA>
# 3: 3 15 February 2021    Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
# 4: 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day

Upvotes: 0

G. Grothendieck
G. Grothendieck

Reputation: 270318

Preface each line with a tag and then use read.dcf to create a 3 column character matrix mat. At the end we convert that to a character vector with one element per logical record but you may just want to use mat since that seems like a more useful format.

We assume that the dates have the %d %B %Y format (see ?strptime for the percent codes), that salary lines start with £ and other lines are Address lines.

library(dplyr)

mat <- dat %>%
  mutate(detail = case_when(
    !is.na(as.Date(detail, "%d %B %Y")) ~ paste("\nDate:", detail), 
    grepl("^£", detail) ~ paste("Salary:", detail),
    TRUE ~ paste("Address:", detail))) %>%
  { read.dcf(textConnection(.$detail)) }

mat %>%
  apply(1, toString) %>%
  sub(", NA$", "", .)

Update

Simplied assumptions and code.

Upvotes: 3

Related Questions