user9211845
user9211845

Reputation:

Extract Specific Value from String R

I have a data frame from which i want to extract specific value from a string with a condition.

    DF1

structure(list(Sr. No. = c("1", "2"), String= c("ABCD, your Account XX1987 has been credited with EUR 22,500.00 on 30-
            Oct-17. Info: CAM*CASH DEPOSIT*ELISH SEC. The Available Balance is EUR 
            22,951.57.", 
    "WXYZ, Your Ac XXXXXXXX1987 is debited with USD 5,000.00 on 14 
            May. Info. MMT*125485645*99999999. Your Net Available Balance is 
            USD 20,531.38.)")), .Names = c("Sr. No.", "String"), row.names = 1:2, class = "data.frame")

DF

Sr. No. String

1.      ABCD, your Account XX1987 has been credited with EUR 22,500.00 on 30-
        Oct-17. Info: CAM*CASH DEPOSIT*ELISH SEC. The Available Balance is EUR 
        22,951.57.
2.      WXYZ, Your Ac XXXXXXXX1987 is debited with USD 5,000.00 on 14 
        May. Info. MMT*125485645*99999999. Your Net Available Balance is 
        USD 20,531.38.

From that Dataframe i want below mentioned dataframe with specific condition.

Conditions:
1. Take first coming word credited/debited/credit/debit as "Credit" or "Debit" in type.
2. Take last four digit after your Account/your Ac/your a/c or your acc (or the string lookes like XXXX1234) in Acc.
3. Take first value coming after credited/debited/credit/debit word in the sring as Fig.
4. Take date after word "on" or which lookes like date from string in Date column.
5. Take description in desc after word Info:
6. Take balance after word Available Balance/Net Balance/Balance or Last Numeric figure in the string.

DF2
    Sr.No.      Type      Acc      Fig     Date       Desc             Balance
    1           Credit    1987     22,500  30-10-2017 Info: CAM*CASH   22,951
                                                      DEPOSIT*ELISH SEC.
    2           Debit     1987     5,000   14-May     Info.            20,531.38
                                                      MMT*125485645*99999999.

Upvotes: 1

Views: 252

Answers (2)

Onyambu
Onyambu

Reputation: 79228

Following the pattern, you can combine all the regular expressionas in a single line and extract the information:

 pat=c(Account="(?<=X)\\d+",
 Type="(credit|debit)",
 Fig="(\\w{1,3}\\s\\d+.*\\.\\d+\\s)",
 Date="(\\d+\\s\\w+\\.)|(?<=on\\s)(\\d+\\W\\w+\\W\\d+)",
 Decs="(Info.*\\.\\s)",
 Balance="(?<=Balance\\s\\is\\s).*\\.")
 data.frame(mapply(str_extract,DF[2],pat))
  String    NA.           NA..1     NA..2                              NA..3          NA..4
1   1987 credit  EUR 22,500.00  30-Oct-17 Info: CAM*CASH DEPOSIT*ELISH SEC.  EUR 22,951.57.
2   1987  debit   USD 5,000.00    14 May.     Info. MMT*125485645*99999999.  USD 20,531.38.
3   1234 credit INR 187,314.00   31/10/17                    

Upvotes: 0

Sergej Andrejev
Sergej Andrejev

Reputation: 9413

I tried to write as general expressions as I could come up with but if the data is not always structured the same way there might be a need to tune Regex

library(stringr)
input = structure(list(
  `Sr. No.`=c("1", "2"), 
  String=c(
    "ABCD, your Account XX1987 has been credited with EUR 22,500.00 on 30-Oct-17. Info: CAM*CASH DEPOSIT*ELISH SEC. The Available Balance is EUR 22,951.57.", 
    "WXYZ, Your Ac XXXXXXXX1987 is debited with USD 5,000.00 on 14 May. Info. MMT*125485645*99999999. Your Net Available Balance is USD 20,531.38.)")), 
  .Names=c("Sr. No.", "String"), row.names=1:2, class="data.frame")

rule_13 = str_match(input$String, "(credit|debit)ed[^0-9]*((?:EUR|USD|INR|Rs) [0-9,.]+)")
rule_2 = str_match(input$String, "(?:Account|your Ac|your a/c|your acc|XX)[^0-9]*([0-9]+)")
rule_4 = str_match(input$String, " on ([0-9]+[ -](?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|[0-9]+)(?:[ -][0-9]+)?)")
rule_5 = str_match(input$String, "\\bInfo\\b[^\\w\\d]+(.+)(?=\\. )")
rule_6 = str_match(input$String, "(?:Available Balance|Net Balance|Balance)[^0-9]*([0-9,.]+[0-9])")

data.frame(
    Sr.No=input$`Sr. No.`,
    Type=rule_13[,2],
    Acc=rule_2[,2],
    Fig=rule_13[,3],
    Data=rule_4[,2],
    Desc=rule_5[,2],
    Balance=rule_6[,2])

Output

Sr.No   Type  Acc       Fig      Data                       Desc   Balance
    1 credit 1987 22,500.00 30-Oct-17 CAM*CASH DEPOSIT*ELISH SEC 22,951.57
    2  debit 1987  5,000.00    14 May     MMT*125485645*99999999 20,531.38

Upvotes: 2

Related Questions