Vector JX
Vector JX

Reputation: 179

Run Regex in R dataframe based on structure of string

Having below mentioned dataframe where regex working fine for first 2 string but not working for third string (Where the format of string is differ from above two). I want a code which check first format of string and than run regex to get best result out of the string to give below mentioned six fields in Output data.

    library(stringr)
    input = structure(list(
      `Sr. No.`=c("1", "2","3"), 
      String=c(
        "ABCD, your Account XX1987 has been credited with EUR 22,500.00 on 30-Oct-17. Info: CAM*CASH DEPOSIT*ELISH SEC. The Available Balance is EUR 22,951.57.", 
        "WXYZ, Your Ac XXXXXXXX1987 is debited with USD 5,000.00 on 14 May. Info. MMT*125485645*99999999. Your Net Available Balance is USD 20,531.38.,
"INR 187,314.00 credited to your A/c No XXXXXXX1234 on 31/10/17 through NEFT with UTR )")), 
      .Names=c("Sr. No.", "String"), row.names=1:2, class="data.frame")

    rule_13 = str_match(input$String, "(credit|debit)ed[^0-9]*((?:EUR|USD|INR|Rs) [0-9,.]+)")
    rule_2 = str_match(input$String, "(?:Account|your Ac|your a/c|your acc|XX)[^0-9]*([0-9]+)")
    rule_4 = str_match(input$String, " on ([0-9]+[ -](?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|[0-9]+)(?:[ -][0-9]+)?)")
    rule_5 = str_match(input$String, "\\bInfo\\b[^\\w\\d]+(.+)(?=\\. )")
    rule_6 = str_match(input$String, "(?:Available Balance|Net Balance|Balance)[^0-9]*([0-9,.]+[0-9])")

    data.frame(
        Sr.No=input$`Sr. No.`,
        Type=rule_13[,2],
        Acc=rule_2[,2],
        Fig=rule_13[,3],
        Data=rule_4[,2],
        Desc=rule_5[,2],
        Balance=rule_6[,2])

Output:

Sr.No   Type  Acc       Fig      Data                       Desc   Balance
    1 credit 1987 22,500.00 30-Oct-17 CAM*CASH DEPOSIT*ELISH SEC 22,951.57
    2  debit 1987  5,000.00    14 May     MMT*125485645*99999999 20,531.38
    3 credit 1234           31/10/2017

Upvotes: 1

Views: 102

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626861

You may use two regexps to make things simpler and more readable: after running your rule_13 regex and then an additional regex to match the other format, then check if the Group 1 or 2 of Regex 1 matched, and if not, use the value you obtained with the second regex.

input <- "INR 187,314.00 credited to your A/c No XXXXXXX1234 on 31/10/17 through NEFT with UTR )"
rule_13 = str_match(input, "(credit|debit)ed[^0-9]*((?:EUR|USD|INR|Rs) [0-9,.]+)")
##> rule_13
##     [,1] [,2] [,3]
##[1,] NA   NA   NA  
rule_13_1 = str_match(input, "(?:EUR|USD|INR|Rs)\\s*(\\d[0-9,.]*)\\b")
##> rule_13_1
##     [,1]             [,2]        
##[1,] "INR 187,314.00" "187,314.00"
fig1 <- ifelse(!is.na(rule_13[,2]),rule_13[,2],rule_13_1[,2])
fig1
## => [1] "187,314.00"

So, you will just have to replace Fig=rule_13[,3], with Fig=fig1,.

The second pattern matches

  • (?:EUR|USD|INR|Rs) - either EUR, USD, INR or Rs substring
  • \\s* - 0+ whitespaces
  • (\\d[0-9,.]*) - Group 1 (will be in [,2], the [,1] is the whole match): a digit followed with 0+ digits, , or .
  • \\b - a word boundary.

Upvotes: 1

Related Questions