Reputation: 179
Having below mentioned dataframe where regex working fine for first 2 string but not working for third string (Where the format of string is differ from above two). I want a code which check first format of string and than run regex to get best result out of the string to give below mentioned six fields in Output data.
library(stringr)
input = structure(list(
`Sr. No.`=c("1", "2","3"),
String=c(
"ABCD, your Account XX1987 has been credited with EUR 22,500.00 on 30-Oct-17. Info: CAM*CASH DEPOSIT*ELISH SEC. The Available Balance is EUR 22,951.57.",
"WXYZ, Your Ac XXXXXXXX1987 is debited with USD 5,000.00 on 14 May. Info. MMT*125485645*99999999. Your Net Available Balance is USD 20,531.38.,
"INR 187,314.00 credited to your A/c No XXXXXXX1234 on 31/10/17 through NEFT with UTR )")),
.Names=c("Sr. No.", "String"), row.names=1:2, class="data.frame")
rule_13 = str_match(input$String, "(credit|debit)ed[^0-9]*((?:EUR|USD|INR|Rs) [0-9,.]+)")
rule_2 = str_match(input$String, "(?:Account|your Ac|your a/c|your acc|XX)[^0-9]*([0-9]+)")
rule_4 = str_match(input$String, " on ([0-9]+[ -](?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|[0-9]+)(?:[ -][0-9]+)?)")
rule_5 = str_match(input$String, "\\bInfo\\b[^\\w\\d]+(.+)(?=\\. )")
rule_6 = str_match(input$String, "(?:Available Balance|Net Balance|Balance)[^0-9]*([0-9,.]+[0-9])")
data.frame(
Sr.No=input$`Sr. No.`,
Type=rule_13[,2],
Acc=rule_2[,2],
Fig=rule_13[,3],
Data=rule_4[,2],
Desc=rule_5[,2],
Balance=rule_6[,2])
Output:
Sr.No Type Acc Fig Data Desc Balance
1 credit 1987 22,500.00 30-Oct-17 CAM*CASH DEPOSIT*ELISH SEC 22,951.57
2 debit 1987 5,000.00 14 May MMT*125485645*99999999 20,531.38
3 credit 1234 31/10/2017
Upvotes: 1
Views: 102
Reputation: 626861
You may use two regexps to make things simpler and more readable: after running your rule_13
regex and then an additional regex to match the other format, then check if the Group 1 or 2 of Regex 1 matched, and if not, use the value you obtained with the second regex.
input <- "INR 187,314.00 credited to your A/c No XXXXXXX1234 on 31/10/17 through NEFT with UTR )"
rule_13 = str_match(input, "(credit|debit)ed[^0-9]*((?:EUR|USD|INR|Rs) [0-9,.]+)")
##> rule_13
## [,1] [,2] [,3]
##[1,] NA NA NA
rule_13_1 = str_match(input, "(?:EUR|USD|INR|Rs)\\s*(\\d[0-9,.]*)\\b")
##> rule_13_1
## [,1] [,2]
##[1,] "INR 187,314.00" "187,314.00"
fig1 <- ifelse(!is.na(rule_13[,2]),rule_13[,2],rule_13_1[,2])
fig1
## => [1] "187,314.00"
So, you will just have to replace Fig=rule_13[,3],
with Fig=fig1,
.
The second pattern matches
(?:EUR|USD|INR|Rs)
- either EUR
, USD
, INR
or Rs
substring\\s*
- 0+ whitespaces(\\d[0-9,.]*)
- Group 1 (will be in [,2]
, the [,1]
is the whole match): a digit followed with 0+ digits, ,
or .
\\b
- a word boundary.Upvotes: 1