Reputation:
I have a data frame from which i want to extract specific value from a string with a condition.
DF1
structure(list(Sr. No. = c("1", "2"), String= c("ABCD, your Account XX1987 has been credited with EUR 22,500.00 on 30-
Oct-17. Info: CAM*CASH DEPOSIT*ELISH SEC. The Available Balance is EUR
22,951.57.",
"WXYZ, Your Ac XXXXXXXX1987 is debited with USD 5,000.00 on 14
May. Info. MMT*125485645*99999999. Your Net Available Balance is
USD 20,531.38.)")), .Names = c("Sr. No.", "String"), row.names = 1:2, class = "data.frame")
DF
Sr. No. String
1. ABCD, your Account XX1987 has been credited with EUR 22,500.00 on 30-
Oct-17. Info: CAM*CASH DEPOSIT*ELISH SEC. The Available Balance is EUR
22,951.57.
2. WXYZ, Your Ac XXXXXXXX1987 is debited with USD 5,000.00 on 14
May. Info. MMT*125485645*99999999. Your Net Available Balance is
USD 20,531.38.
From that Dataframe i want below mentioned dataframe with specific condition.
Conditions:
1. Take first coming word credited/debited/credit/debit as "Credit" or "Debit" in type.
2. Take last four digit after your Account/your Ac/your a/c or your acc (or the string lookes like XXXX1234) in Acc.
3. Take first value coming after credited/debited/credit/debit word in the sring as Fig.
4. Take date after word "on" or which lookes like date from string in Date column.
5. Take description in desc after word Info:
6. Take balance after word Available Balance/Net Balance/Balance or Last Numeric figure in the string.
DF2
Sr.No. Type Acc Fig Date Desc Balance
1 Credit 1987 22,500 30-10-2017 Info: CAM*CASH 22,951
DEPOSIT*ELISH SEC.
2 Debit 1987 5,000 14-May Info. 20,531.38
MMT*125485645*99999999.
Upvotes: 1
Views: 252
Reputation: 79228
Following the pattern, you can combine all the regular expressionas in a single line and extract the information:
pat=c(Account="(?<=X)\\d+",
Type="(credit|debit)",
Fig="(\\w{1,3}\\s\\d+.*\\.\\d+\\s)",
Date="(\\d+\\s\\w+\\.)|(?<=on\\s)(\\d+\\W\\w+\\W\\d+)",
Decs="(Info.*\\.\\s)",
Balance="(?<=Balance\\s\\is\\s).*\\.")
data.frame(mapply(str_extract,DF[2],pat))
String NA. NA..1 NA..2 NA..3 NA..4
1 1987 credit EUR 22,500.00 30-Oct-17 Info: CAM*CASH DEPOSIT*ELISH SEC. EUR 22,951.57.
2 1987 debit USD 5,000.00 14 May. Info. MMT*125485645*99999999. USD 20,531.38.
3 1234 credit INR 187,314.00 31/10/17
Upvotes: 0
Reputation: 9413
I tried to write as general expressions as I could come up with but if the data is not always structured the same way there might be a need to tune Regex
library(stringr)
input = structure(list(
`Sr. No.`=c("1", "2"),
String=c(
"ABCD, your Account XX1987 has been credited with EUR 22,500.00 on 30-Oct-17. Info: CAM*CASH DEPOSIT*ELISH SEC. The Available Balance is EUR 22,951.57.",
"WXYZ, Your Ac XXXXXXXX1987 is debited with USD 5,000.00 on 14 May. Info. MMT*125485645*99999999. Your Net Available Balance is USD 20,531.38.)")),
.Names=c("Sr. No.", "String"), row.names=1:2, class="data.frame")
rule_13 = str_match(input$String, "(credit|debit)ed[^0-9]*((?:EUR|USD|INR|Rs) [0-9,.]+)")
rule_2 = str_match(input$String, "(?:Account|your Ac|your a/c|your acc|XX)[^0-9]*([0-9]+)")
rule_4 = str_match(input$String, " on ([0-9]+[ -](?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|[0-9]+)(?:[ -][0-9]+)?)")
rule_5 = str_match(input$String, "\\bInfo\\b[^\\w\\d]+(.+)(?=\\. )")
rule_6 = str_match(input$String, "(?:Available Balance|Net Balance|Balance)[^0-9]*([0-9,.]+[0-9])")
data.frame(
Sr.No=input$`Sr. No.`,
Type=rule_13[,2],
Acc=rule_2[,2],
Fig=rule_13[,3],
Data=rule_4[,2],
Desc=rule_5[,2],
Balance=rule_6[,2])
Output
Sr.No Type Acc Fig Data Desc Balance
1 credit 1987 22,500.00 30-Oct-17 CAM*CASH DEPOSIT*ELISH SEC 22,951.57
2 debit 1987 5,000.00 14 May MMT*125485645*99999999 20,531.38
Upvotes: 2