eshaa
eshaa

Reputation: 406

Manipulate String to XML using Java

I have extracted data from PDF which is in string format like below.(Please note on uneven spacing and new line characters).

 Virtual Salary                                 25,100.00   EIS EE Contr.                                       7.90
 Virtual Car Allowance                           1,600.00   EPF Employee Contr.                             2,937.00
 Payment Received(Oversea)                       4,265.01   SOCSO Employee Contr.                              19.75

How to convert this string to XML like below.

public void testMethod()
    {
        String extractedTestFromPDF=
                 " Virtual Salary                                 25,100.00   EIS EE Contr.                                       7.90\n"+
                 "\t Virtual Car Allowance                           1,600.00   EPF Employee Contr.                             2,937.00\n"+
                 " Payment Received(Oversea)                       4,265.01   SOCSO Employee Contr.                              19.75\n";

    }

Desire XML:

<xml>
<Data>
    <Allowance>Virtual Salary</Allowance>
    <Allowance_Amount>25,100.00</Allowance_Amount>
</Data>
<Data>
    <Allowance>EIS EE Contr.</Allowance>
    <Allowance_Amount>7.90</Allowance_Amount>
</Data>
<Data>
    <Allowance>Virtual Car Allowance</Allowance>
    <Allowance_Amount>1,600.00</Allowance_Amount>
</Data>
...
</xml>

Upvotes: 1

Views: 63

Answers (1)

Joop Eggen
Joop Eggen

Reputation: 109547

String fixedSizetoXML(String extractedTestFromPDF) {
    String[] lines = extractedTestFromPDF.split("\\R");
    Pattern pattern = Pattern.compile("^\\s*(\\S.{20})\\s\\s+([-\\d,\\.]+)\\s+.*$");
    //                                      (--------)       (-----------)
    return "<?xml verion="1.0">\n<Xml>\n"
        + Stream.of(lines)
              .map(pattern::matcher)
              .filter(m::find)
              .map(m -> String.format("<Data>\n"
                            + "    <Allowance>%s</Allowance>\n"
                            + "    <Allowance_Amount>%s</Allowance_Amount>\n"
                            + "</Data>\n",
                            m.group(1).trim(), m.group(2)))

              .collect(Collectors.joining(""))
        + "<Xml>\n";
}

I took the liberty adding an XML preprocessing instruction <?xml ...> and for clarity changing xml to Xml.

These are records with fixed length fields. Counting positions is not entirely safe, seeing a tab char \t and and considering special characters: é could be one char, but also e plus a special zero width ´, I used a regex pattern instead. Requiring at least two whitespace chars before the amount.


Java 7

String fixedSizetoXML(String extractedTestFromPDF) {
    String[] lines = extractedTestFromPDF.split("\\R");
    Pattern pattern = Pattern.compile("^\\s*(\\S.{20})\\s\\s+([-\\d,\\.]+)\\s+.*$");
    //                                      (--------)       (-----------)
    StringBuilder sb = new StringBuilder(lines.length * 64);
    sb.append("<?xml verion="1.0">\n<Xml>\n");
    for (String line : lines) {
        Matcher m = pattern.matcher(line);
        if (m.find()) {
            String data = String.format("<Data>\n"
                            + "    <Allowance>%s</Allowance>\n"
                            + "    <Allowance_Amount>%s</Allowance_Amount>\n"
                            + "</Data>\n",
                            m.group(1).trim(), m.group(2));
            sb.append(data);
        }
    }
    sb.append("<Xml>\n");
    return sb.toString();
}

Upvotes: 1

Related Questions