Reputation: 49
So, I'm the happy owner of a 17246 list of data frames and need to extract 3 data from each of them:
Each data frame contains data about just one worker. But the data is inputted differently: It always starts by the regular expression “Worker:” + “Name or number identification”.
So, I can find the data with a regular expression that targets “Worker:”
I can also target the first regular expression that represents a date: “dd/dd/dd”
The desired output is a df with 3 columns (“Worker”, “Code”, “Date”) and then unite all dfs into one. In order to achieve this end, I find myself with three problems:
a) The information is presented in no order (cannot subset specific rows).
b) The intended worker and code are a substring inside other characters.
c) More then one date is presented on each df and I only desire the first match. All other dates are misleading.
The input is this:
v1 <- c("Worker: Joseph", "06/01/21", "12-00.07", "06/19/21", "useless", "06-11.85")
v2 <- c("useless","99-08-70", "Worker: 3rd", "05/01/21", "useless", "25-57.99", "07/01/21")
df1 <- data.frame(text = v1)
df2 <- data.frame(text = v2)
PDF_list <- list(df1, df2)
The desired outcome is this:
library(dplyr)
n1 <- c("Joseph", "Joseph")
c1 <- c("00", "11")
d1 <- c("06/01/21", "06/01/21")
n2 <- c("3rd", "3rd")
c2 <- c("08", "57")
d2 <- c("05/01/21", "05/01/21")
df1 <- data.frame(name = n1, code = c1, date = d1)
df2 <- data.frame(name = n2, code = c2, date = d2)
PDF_list <- list(df1, df2)
one_df <- bind_rows(PDF_list)
So far, I've managed to write this poor excuse of a code. It doesn’t select the substrings and it cheats to get the desired date:
library(tidyverse)
library(tidyr)
library(stringr)
v1 <- c("Worker: Joseph", "06/01/21", "12-00.07", "06/19/21", "useless", "06-11.85")
v2 <- c("useless","99-08-70", "Worker: 3rd", "05/01/21", "useless", "25-57.99", "07/01/21")
df1 <- data.frame(text = v1)
df2 <- data.frame(text = v2)
PDF_list <- list(df1, df2)
for(num in 1:length(PDF_list)){
worker <- filter(PDF_list[[num]], grepl("Worker:\\s*?(\\w.+)", text))
code <- filter(PDF_list[[num]], grepl("-(\\d{2}).+", text))
date <- filter(PDF_list[[num]], grepl("^\\d{2}/\\d{2}.+", text))
if(nrow(date) > 1){
date <- date[1,1]
}
t_list <- cbind(worker, code, date)
names(t_list) <- c("name", "code", "date")
PDF_list[[num]] <- t_list
}
rm(worker, code, date, t_list)
one_df <- bind_rows(PDF_list)
View(one_df)
Any help? Thanks!
Upvotes: 0
Views: 575
Reputation: 887048
A method using tidyverse
list
- map
, arrange
the rows of the data so that row with the 'Worker:' becomes the top rowlist
elements as a single dataset with _dfr
suffix in map
, while creating a grouping index by specifying the .id
summarise
to create summarised output with the first
'date' from the pattern two digits followed by /
, two digits /
and two digits from the start (^
) till the end ($
) of the string elements in 'text' columnstr_remove
-
or .
library(dplyr)
library(stringr)
library(purrr)
PDF_list %>%
map_dfr(~ .x %>%
arrange(!str_detect(text, 'Worker:')), .id = 'grp') %>%
group_by(grp) %>%
summarise(date = first(text[str_detect(text, "^\\d{2}/\\d{2}/\\d{2}$")]),
name = str_remove(first(text), "Worker:\\s*"),
code = str_replace(text[str_detect(text, '^\\d+-(\\d+)[.-]\\d+$')],
"^\\d+-(\\d+)[.-]\\d+$", "\\1"), .groups = 'drop') %>%
select(name, code, date)
-output
# A tibble: 4 x 3
name code date
<chr> <chr> <chr>
1 Joseph 00 06/01/21
2 Joseph 11 06/01/21
3 3rd 08 05/01/21
4 3rd 57 05/01/21
Upvotes: 1