Reputation: 1634
I am trying to merge rows by pattern.
The dataframe has only one column (string) and normally, it should follow a pattern of date, company_name and salary. However, some cases just don't have the salary.
Is there is a way I can merge the rows by the pattern of the date? By doing so, I can later split them into columns. The reason why I didn't want to do pivot_wider earlier was that it's likely to get mismatched between the company name and salary - unbalanced rows. So I think it's better to merge the rows by the date pattern as the date is never missing and following a pattern.
dataset:
# A tibble: 10 x 1
detail
<chr>
1 26 January 2021
2 NatWest Group - Bristol, BS2 0PT
3 26 January 2021
4 NatWest Group - Manchester, M3 3AQ
5 15 February 2021
6 Brook Street - Liverpool, Merseyside, L21AB
7 £13.84 per hour
8 16 February 2021
9 Anglo Technical Recruitment - London, WC2N 5DU
10 £400.00 per day
dput for the dataset:
structure(list(detail = c("26 January 2021", "NatWest Group - Bristol, BS2 0PT",
"26 January 2021", "NatWest Group - Manchester, M3 3AQ", "15 February 2021",
"Brook Street - Liverpool, Merseyside, L21AB", "£13.84 per hour",
"16 February 2021", "Anglo Technical Recruitment - London, WC2N 5DU",
"£400.00 per day")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
Expected outcome:
detail
<chr>
1 26 January 2021 NatWest Group - Bristol, BS2 0PT
2 26 January 2021 NatWest Group - Manchester, M3 3AQ
3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
dput for expected outcome:
df <- structure(list(detail = c("26 January 2021 NatWest Group - Bristol, BS2 0PT",
"26 January 2021 NatWest Group - Manchester, M3 3AQ", "15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour",
"16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
Upvotes: 0
Views: 311
Reputation: 26238
One more solution assuming only that first row contains a date. It'll work irrespective of the number of rows in between two dates..
library(tidyverse)
df %>% group_by(d = cumsum(str_detect(detail, "^(^\\d\\d? \\w+ \\d{4})$"))) %>%
mutate(c = paste0("Col", as.character(row_number()))) %>%
pivot_wider(id_cols = d, values_from = detail, names_from = c)
# A tibble: 4 x 4
# Groups: d [4]
d Col1 Col2 Col3
<int> <chr> <chr> <chr>
1 1 26 January 2021 NatWest Group - Bristol, BS2 0PT NA
2 2 26 January 2021 NatWest Group - Manchester, M3 3AQ NA
3 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
Upvotes: 1
Reputation: 42592
Here is a data.table approach which uses dcast()
and rowid()
to reshape to wide format. It returns a data.table with four columns: a record number, date,
company_name, and salary.
library(data.table)
setDT(df1)[, rn := cumsum(!is.na(lubridate::dmy(detail)))]
dcast(df1, rn ~ rowid(rn, prefix = "Col"), value.var = "detail")
rn Col1 Col2 Col3 1: 1 26 January 2021 NatWest Group - Bristol, BS2 0PT <NA> 2: 2 26 January 2021 NatWest Group - Manchester, M3 3AQ <NA> 3: 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour 4: 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
For detecting rows which start a new record, i.e., rows with a date, this approach borrows from Anil's answer as well as from G.Grothendieck's.
dcast()
allows to pack all into a "one-liner" (if the library()
calls are not counted):
library(data.table)
library(lubridate)
dcast(setDT(df1), cumsum(!is.na(dmy(detail))) ~ rowid(cumsum(!is.na(dmy(detail))), prefix = "Col"),
value.var = "detail")
Upvotes: 0
Reputation: 27792
Here is a pure data.table
approach
library( data.table )
#make it a data.table
setDT( df )
#first, summarise by block separated by days, collapse the text, using @@ as separator
ans <- df[, .( paste0( detail, collapse = "@@") ),
by = .(d = cumsum( ( grepl( "[0-9]{2} [a-zA-Z]+ [0-9]{4}", detail) ) ) ) ]
#split text again to cols, based on te @@ introduced in the collapse/ Number of cols is dynamic!
ans[, paste0( "Col", 1:length( tstrsplit(ans$V1, "@@" ))) := tstrsplit( V1, "@@" )][, V1 := NULL ][]
# d Col1 Col2 Col3
# 1: 1 26 January 2021 NatWest Group - Bristol, BS2 0PT <NA>
# 2: 2 26 January 2021 NatWest Group - Manchester, M3 3AQ <NA>
# 3: 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
# 4: 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
Upvotes: 0
Reputation: 270318
Preface each line with a tag and then use read.dcf
to create a 3 column character matrix mat
. At the end we convert that to a character vector with one element per logical record but you may just want to use mat
since that seems like a more useful format.
We assume that the dates have the %d %B %Y format (see ?strptime
for the percent codes), that salary lines start with £ and other lines are Address lines.
library(dplyr)
mat <- dat %>%
mutate(detail = case_when(
!is.na(as.Date(detail, "%d %B %Y")) ~ paste("\nDate:", detail),
grepl("^£", detail) ~ paste("Salary:", detail),
TRUE ~ paste("Address:", detail))) %>%
{ read.dcf(textConnection(.$detail)) }
mat %>%
apply(1, toString) %>%
sub(", NA$", "", .)
Simplied assumptions and code.
Upvotes: 3