R - Regular Expressions (Regex) with a list of Data Frames (only first match)

Question

So, I'm the happy owner of a 17246 list of data frames and need to extract 3 data from each of them:

To whom the job was given.
The standard code that describes what kind of job it is (Ex. "00" inside this "12-00.07").
The date on which it was assigned.

Each data frame contains data about just one worker. But the data is inputted differently: It always starts by the regular expression “Worker:” + “Name or number identification”.

So, I can find the data with a regular expression that targets “Worker:”

I can also target the first regular expression that represents a date: “dd/dd/dd”

The desired output is a df with 3 columns (“Worker”, “Code”, “Date”) and then unite all dfs into one. In order to achieve this end, I find myself with three problems:

a) The information is presented in no order (cannot subset specific rows).

b) The intended worker and code are a substring inside other characters.

c) More then one date is presented on each df and I only desire the first match. All other dates are misleading.

The input is this:

v1 <- c("Worker: Joseph", "06/01/21", "12-00.07", "06/19/21", "useless", "06-11.85")
v2 <- c("useless","99-08-70", "Worker: 3rd", "05/01/21", "useless", "25-57.99", "07/01/21")


df1 <- data.frame(text = v1)
df2 <- data.frame(text = v2)

PDF_list <- list(df1, df2)

The desired outcome is this:

library(dplyr)

n1 <- c("Joseph", "Joseph")
c1 <- c("00", "11")
d1 <- c("06/01/21", "06/01/21")

n2 <- c("3rd", "3rd")
c2 <- c("08", "57")
d2 <- c("05/01/21", "05/01/21")

df1 <- data.frame(name = n1, code = c1, date = d1)
df2 <- data.frame(name = n2, code = c2, date = d2)

PDF_list <- list(df1, df2)

one_df <- bind_rows(PDF_list)

So far, I've managed to write this poor excuse of a code. It doesn’t select the substrings and it cheats to get the desired date:

library(tidyverse)
library(tidyr)
library(stringr)

v1 <- c("Worker: Joseph", "06/01/21", "12-00.07", "06/19/21", "useless", "06-11.85")
v2 <- c("useless","99-08-70", "Worker: 3rd", "05/01/21", "useless", "25-57.99", "07/01/21")


df1 <- data.frame(text = v1)
df2 <- data.frame(text = v2)

PDF_list <- list(df1, df2)

for(num in 1:length(PDF_list)){
  worker <- filter(PDF_list[[num]], grepl("Worker:\s*?(\w.+)", text))
  
  code <-  filter(PDF_list[[num]], grepl("-(\d{2}).+", text))
  
  date <- filter(PDF_list[[num]], grepl("^\d{2}/\d{2}.+", text))
  
  if(nrow(date) > 1){
    date <- date[1,1]
  }
  
  t_list <- cbind(worker, code, date)
  names(t_list) <- c("name", "code", "date")
  PDF_list[[num]] <- t_list
}

rm(worker, code, date, t_list)

one_df <- bind_rows(PDF_list)

View(one_df)

Any help? Thanks!

akrun · Accepted Answer

A method using tidyverse

Loop over the list - map, arrange the rows of the data so that row with the 'Worker:' becomes the top row
Bind the list elements as a single dataset with _dfr suffix in map, while creating a grouping index by specifying the .id
Group by 'grp' column
Use summarise to create summarised output with the first 'date' from the pattern two digits followed by /, two digits / and two digits from the start (^) till the end ($) of the string elements in 'text' column
The first element will become 'name' after removing the substring 'Worker:' and any spaces - str_remove
Similarly, we extract the 'code' rows based on capturing the digits from those having only digits with some characters - or .

library(dplyr)
library(stringr)
library(purrr)
PDF_list %>% 
      map_dfr(~ .x %>% 
          arrange(!str_detect(text, 'Worker:')), .id = 'grp') %>% 
      group_by(grp) %>%
      summarise(date = first(text[str_detect(text, "^\d{2}/\d{2}/\d{2}$")]), 
       name = str_remove(first(text), "Worker:\s*"), 
       code = str_replace(text[str_detect(text, '^\d+-(\d+)[.-]\d+$')],
         "^\d+-(\d+)[.-]\d+$", "\1"), .groups = 'drop') %>% 
     select(name, code, date)

-output

# A tibble: 4 x 3
  name   code  date    
        
1 Joseph 00    06/01/21
2 Joseph 11    06/01/21
3 3rd    08    05/01/21
4 3rd    57    05/01/21

R - Regular Expressions (Regex) with a list of Data Frames (only first match)

Answers (1)

Related Questions