Reputation: 17250
I'm trying to convert the following XML to CSV using XPath 3.0 (xidel --xpath
):
<?xml version="1.0" encoding="utf-8" ?>
<csv>
<record>
<field1>A</field1>
<field2>B</field2>
<field3>C</field3>
</record>
<record>
<field2> </field2>
<field3></field3>
</record>
<record>
<field1>,,</field1>
<field2>""</field2>
<field3>..</field3>
<field3>.
.</field3>
</record>
</csv>
My expected output would be:
field1,field2,field3
A,B,C
, ,""
",,","""""",".
."
I've got a few problems (the first one isn't specific to xidel
):
I get the fields names with distinct-values(/csv/record/*/name())
; how can I use that sequence for extracting the data in the records?
I would like to differentiate between a missing and an empty field but the text()
selector of xidel
doesn't seem to care about that; is it a XPath feature or a xidel
bug?
I can't make return
work; does xidel
use a different syntax?
I solved #1 myself and #3 was resolved by @ConalTuohy in his comment.
Here's what I got now:
#!/bin/bash
IFS='' read -r -d '' xpath <<'EOF'
let $csv-escape-string := function($str as xs:string) as xs:string {
if ( matches( $str, ',|"|\n' ) )
then
concat('"',replace($str,'"','""'),'"')
else
$str
},
$fields-names := distinct-values(/csv/record/*/name()),
$csv := (
string-join( $fields-names, ',' ),
/csv/record/string-join(
(
for $fn in $fields-names
return $csv-escape-string(string( *[name()=$fn][last()]/text() ))
), ','
)
)
return $csv
EOF
xidel --xpath "$xpath" file.xml
But the output isn't what I would like it to be:
field1,field2,field3
A,B,C
,,
",,","""""",".
."
Could someone try it with an other XPath-3 processor for making sure that it is xidel
that is normalizing text()
?
Upvotes: 1
Views: 513
Reputation: 17250
I've made "better" alternatives afterwards.
The main improvement is that now the selector /csv/record
only needs to be specified once.
let
$records := /csv/record,
$fields := distinct-values( $records/*/local-name() )
return (
string-join( $fields, "," ),
for $r in $records return
string-join(
for $f in $fields return
"""" || replace($r/*[local-name()=$f][last()],"""","""""") || """",
","
)
)
field1,field2,field3
"A","B","C"
""," ",""
",,","""""",".
."
null
and empty
values, and quoting fields only when required:let $csv-escape-string := function($str as xs:string?) as xs:string {
if ( not(exists($str)) ) then
""
else if ( $str = "" ) then
""""""
else if ( matches( $str, ",|""""|\n" ) ) then
"""" || replace($str,"""","""""") || """"
else
$str
},
$records := /csv/record,
$fields := distinct-values( $records/*/local-name() )
return (
string-join( $fields, "," ),
for $r in $records return
string-join(
for $f in $fields return
$csv-escape-string($r/*[local-name()=$f][last()]),
","
)
)
field1,field2,field3
A,B,C
, ,""
",,","""""",".
."
Upvotes: 1
Reputation: 3443
But the output isn't what I would like it to be
Most likely your xidel
binary is too old.
xidel --xpath "$xpath" file.xml
The reason you put the query in $xpath
is because of the single- and double-quotes, I guess? It sure is a nifty way to circumvent this problem, but not needed if you use double-quotes for strings. --xpath "function('string')"
is for cmd, while --xpath 'function("string")'
is for bash. To escape a double-quote within a string simply enter a second one:
$ xidel -s input.xml -e '
let $field-names:=distinct-values(csv/record/*/name()) return (
join($field-names,","),
csv/record/join(
for $fn in $field-names return
string(*[name()=$fn][last()]) ! (
if (matches(.,",|""|\n"))
then concat("""",replace(.,"""",""""""),"""")
else .
),","
)
)
'
Alternatively you can use the XQuery notation:
if (matches(.,",|"|\n"))
then concat(""",replace(.,""",""""),""")
else .
You can also use xidel
's own "extended string"-syntax (and use another let-variable instead of !
):
$ xidel -s input.xml -e '
let $field-names:=distinct-values(csv/record/*/name()) return (
join($field-names,","),
csv/record/join(
for $fn in $field-names
let $val:=string(*[name()=$fn][last()])
return
if (matches($val,",|""|\n"))
then x"""{replace($val,"""","""""")}"""
else $val,
","
)
)
'
Upvotes: 2
Reputation: 3258
When I ran your code on XML Workbench I got the following result:
A,B,C
, ,
",,","""""",".
."
NB I changed the last line of the query to return string-join($csv, codepoints-to-string(10))
to return the entire CSV as a single string, so XML Workbench didn't quote each string in the result sequence individually, as it otherwise would.
Upvotes: 1