laloune
laloune

Reputation: 673

Extract xml from csv

I try to extract a whole bunch of xml using awk from a variable that includes csv.

I get the csv from a webservice that spits the following out:

2;1;"<?xml version=""1.0"" encoding=""UTF-8""?>
<project name=""ETLTasks"" version=""6.0"" modified=""1479827853273"" modifiedBy=""admin"" format=""strict"" olapId=""p0"">
  <headers>
    <header name=""comment"" modified=""1394702840960"" modifiedBy="""">
      <comment><![CDATA[Automated tasks for OLAP Server:
- CubeCopy
- CubeRulesCalc]]></comment>
    </header>
  </headers>
</project>
";

I try to use awk to extract the xml. I would like that the double double quotes are replaced by only one double quotes (instead of format=""strict"" => format="strict")

for now I have the following but it does not replace the double doubel quotes as wished:

etlDefinitionClean=`echo -n "$etlDefinition" | cut -d";" -f3`
etlDefClean="${etlDefinitionClean%\"}"
etlDefClean="${etlDefClean#\"}"
awk -F "\"*;\"*" '{ gsub(/\"\"/, "\"", $2) } {print $2}' "$etlDefClean"  > "$fileOut"

what I want to achieve in the end is the following:

<project name="ETLTasks" version="6.0" modified="1479827853273" modifiedBy="admin" format="strict" olapId="p0">
  <headers>
    <header name="comment" modified="1394702840960" modifiedBy="">
      <comment><![CDATA[Automated tasks for OLAP Server:
- CubeCopy
- CubeRulesCalc]]></comment>
    </header>
  </headers>
</project>

and put that in a file

Upvotes: 0

Views: 66

Answers (1)

jas
jas

Reputation: 10865

The command

awk -F '^(2;1;")|(";)' -v RS="" -v dq='""' -v q='"' '{gsub(dq,q,$2); print $2}' csvx.data

gives you the desired result:

<?xml version="1.0" encoding="UTF-8"?>
<project name="ETLTasks" version="6.0" modified="1479827853273"modifiedBy="admin" format="strict" olapId="p0">
  <headers>
    <header name="comment" modified="1394702840960" modifiedBy="">
      <comment><![CDATA[Automated tasks for OLAP Server:
- CubeCopy
- CubeRulesCalc]]></comment>
    </header>
  </headers>
</project>

Using -v to create the quotes is just a convenience to avoid lots of escaping. An equivalent command would be:

$ awk -F '^(2;1;")|(";)' -v RS="" '{gsub("\"\"", "\"" ,$2); print $2}' 

-v RS="" is a special value for the record separator that tells awk to consider any sequential set of non-empty lines to be a single record.

Upvotes: 2

Related Questions