rak953
rak953

Reputation: 29

Parsing a string in R and comparing the values with another column

This may seem trivial but I am really stuck at this problem of comparing a value with this complex string

My data frame looks like this:

Id History Report Month
1001 Jun:2020,030/XXX-May:2020,035/XXX-Apr:2020,040/XXX-Mar:2020,060/XXX July 2021
1003 Jun:2017,823/XXX-May:2017,000/XXX-Apr:2017,000/XXX-Mar:2017,000/XXX July 2021
1005 Apr:2019,000/XXX-Mar:2019,800/XXX-Feb:2019,000/XXX-Jan:2019,000/XXX July 2021
1006 Jun:2020,000/XXX-May:2020,030/XXX-Apr:2020,060/XXX-Mar:2020,090/XXX July 2021

Key, value pair from the column history that will be used in comparison are as following:

Id : 1001 - Jun 2020,030   May 2020, 035   Apr 2020, 040...... 
Id : 1003 - Jun 2017,823   May 2017, 000   Apr 2017, 000...... 

Problem statement is: I want to compare these key, value pair with the report month (i.e. always current month) and make a conditional column based on it. Logic is: 24 months (could be 12 or 36) preceding July 2021 i.e July 2021-Jun 2019, how many key,value pairs have value >= 30 or >= 60 etc for months that lie within this time period. So if a string starts from <Jun 2019, like for 1003, the answer should be 0.

Output

Id Report Month +30_last_24 +30_last_36
1001 July 2021 4 4
1003 July 2021 0 0
1005 July 2021 0 1
1006 July 2021 3 3

I started with R very recently and have no solution to even begin with, so any help would be deeply appreciated.

MODIFIED ORIGINAL DATASET

df <- read.table(header = T, text = "Id History ReportMonth
1001    Jun:2020,030/XXX|May:2020,035/XXX|Apr:2020,040/XXX|Mar:2020,060/XXX 'July 2021'
1003    Jun:2017,DDD/XXX|May:2017,030/XXX|Apr:2017,DDD/STD|Mar:2017,000/XXX 'July 2021'
1005    Apr:2019,000/XXX|Mar:2019,800/DDD|Feb:2019,000/XXX|Jan:2019,000/XXX 'July 2021'
1006    Jun:2020,000/XXX|May:2020,030/XXX|Apr:2020,060/XXX|Mar:2020,090/XXX 'July 2021'")

Upvotes: 1

Views: 137

Answers (2)

AnilGoyal
AnilGoyal

Reputation: 26238

Revised Strategy in view of modifications-

  • separate rows using | but only after escaping it with\\
  • separate into cols using ,
  • extracts digits from values using gsub
  • rest is pretty obvious.

Feel free to ask clarifications, if any.

df <- read.table(header = T, text = "Id History ReportMonth
1001    Jun:2020,030/XXX|May:2020,035/XXX|Apr:2020,040/XXX|Mar:2020,060/XXX 'July 2021'
1003    Jun:2017,DDD/XXX|May:2017,030/XXX|Apr:2017,DDD/STD|Mar:2017,000/XXX 'July 2021'
1005    Apr:2019,000/XXX|Mar:2019,800/DDD|Feb:2019,000/XXX|Jan:2019,000/XXX 'July 2021'
1006    Jun:2020,000/XXX|May:2020,030/XXX|Apr:2020,060/XXX|Mar:2020,090/XXX 'July 2021'")

library(tidyverse)
library(lubridate, warn.conflicts = F)

df %>%
  separate_rows(History, sep = '\\|') %>%
  separate(History, into = c('Hist_mon', 'Hist_val'), sep = ',') %>%
  mutate(Hist_mon = dmy(paste0('1:', Hist_mon)),
         Hist_val = as.numeric(gsub('(\\D*)', '', Hist_val)),
         ReportMonth = dmy(paste0('1 ', ReportMonth))) %>%
  group_by(Id, ReportMonth) %>%
  summarise(last_24_30 = sum(Hist_val >= 30 & Hist_mon >= ReportMonth %m-% months(24)),
            last_36_30 = sum(Hist_val >= 30 & Hist_mon >= ReportMonth %m-% months(36)), .groups = 'drop')
#> # A tibble: 4 x 4
#>      Id ReportMonth last_24_30 last_36_30
#>   <int> <date>           <int>      <int>
#> 1  1001 2021-07-01           4          4
#> 2  1003 2021-07-01           0          0
#> 3  1005 2021-07-01           0          1
#> 4  1006 2021-07-01           3          3

Created on 2021-07-16 by the reprex package (v2.0.0)

Upvotes: 2

Onyambu
Onyambu

Reputation: 79358

library(tidyverse)
library(lubridate)
df %>%
  separate_rows(History, sep = '[|]')%>%
  filter(str_detect(History, "\\w"), str_detect(History, "\\d+/"))%>%
  separate(History, c("Date", "Value", "d"), sep = '[,/]', convert = TRUE) %>%
  mutate(across(c(Date,ReportMonth), ~myd(paste(.x, "01")))) %>%
  group_by(Id) %>%
  summarise(r = list(map(c(m24 = 24, m36 = 36), ~sum(
    Date + months(.x) > ReportMonth & Value >= 30)))) %>%
  unnest_wider(r) %>%
  right_join(df, 'Id')

  
# A tibble: 4 x 5
     Id   m24   m36 History_Report                                                      Month    
  <int> <int> <int> <chr>                                                               <chr>    
1  1001     4     4 Jun:2020,030/XXX-May:2020,035/XXX-Apr:2020,040/XXX-Mar:2020,060/XXX July 2021
2  1003     0     0 Jun:2017,823/XXX-May:2017,000/XXX-Apr:2017,000/XXX-Mar:2017,000/XXX July 2021
3  1005     0     1 Apr:2019,000/XXX-Mar:2019,800/XXX-Feb:2019,000/XXX-Jan:2019,000/XXX July 2021
4  1006     3     3 Jun:2020,000/XXX-May:2020,030/XXX-Apr:2020,060/XXX-Mar:2020,090/XXX July 2021

Upvotes: 1

Related Questions