Reputation: 413
I have a text file that contains data in the following format. This is a sample of the data it contains. The file is correct and in the correct format:
<node id="1647008557" lat="36.6536840" lon="-121.7938995" version="1" timestam p="2012-02-25T14:03:54Z" changeset="10787766" uid="294728" user="skew-t">
<tag k="highway" v="turning_circle"/>
</node>
<way id="10459706" version="2" timestamp="2010-03-27T18:21:32Z" changeset="4247030" uid="20587" user="balrog-kun">
<nd ref="89705976"/>
<nd ref="89798118"/>
<nd ref="89798120"/>
<nd ref="89798122"/>
<nd ref="89798124"/>
<nd ref="89798126"/>
<nd ref="89798128"/>
<nd ref="89798130"/>
<tag k="highway" v="residential"/>
<tag k="name" v="Engineer Road"/>
<tag k="tiger:cfcc" v="A41"/>
<tag k="tiger:county" v="Livingston, CA"/>
<tag k="tiger:name_base" v="Engineer"/>
<tag k="tiger:name_type" v="Rd"/>
<tag k="tiger:reviewed" v="no"/>
<tag k="tiger:separated" v="no"/>
<tag k="tiger:source" v="tiger_import_dch_v0.6_20070809"/>
<tag k="tiger:tlid" v="196844016"/>
</way>
<way id="10461171" version="3" timestamp="2014-01-07T00:17:59Z" changeset="19855176" uid="1871178" user="RBoggs">
<nd ref="89804458"/>
<nd ref="89804460"/>
<nd ref="89804463"/>
<nd ref="89804464"/>
<nd ref="89804466"/>
<nd ref="89804468"/>
<tag k="access" v="no"/>
<tag k="highway" v="residential"/>
<tag k="motor_vehicle" v="no"/>
<tag k="name" v="5th Cutoff Street"/>
<tag k="tiger:cfcc" v="A41"/>
<tag k="tiger:county" v="Marysville, CA"/>
<tag k="tiger:name_base" v="5th Cutoff"/>
<tag k="tiger:name_type" v="St"/>
<tag k="tiger:reviewed" v="no"/>
</way>
<way id="151860745" version="1" timestamp="2012-02-25T14:03:59Z" changeset="10787766" uid="294728" user="skew-t">
<nd ref="1647008614"/>
<nd ref="1647008545"/>
<nd ref="1647008605"/>
<nd ref="1647008555"/>
<nd ref="1647008557"/>
<tag k="highway" v="service"/>
</way>
And I am trying to print out the name
within the way id
section along with the way id
itself, the sequence number the nd ref
is at, and the nd ref
id.
Like so in the correct output:
$ awk -f table.awk file.txt | head
road,way_id,seq_num,node_ref_id
Engineer Road,10459706,1,89705976
Engineer Road,10459706,2,89798118
Engineer Road,10459706,3,89798120
Engineer Road,10459706,4,89798122
Engineer Road,10459706,5,89798124
Engineer Road,10459706,6,89798126
Engineer Road,10459706,7,89798128
Engineer Road,10459706,8,89798130
5th Cutoff Street,10461171,1,89804458
5th Cutoff Street,10461171,2,89804460
5th Cutoff Street,10461171,3,89804463
5th Cutoff Street,10461171,4,89804464
5th Cutoff Street,10461171,5,89804466
5th Cutoff Street,10461171,6,89804468
How would I print that output by ignoring the lines that do not contain <tag k="name"
within the <way>
tag?
Upvotes: 1
Views: 167
Reputation: 3443
"Gilles Quenot" already told you to use a proper XML/HTML parser and he mentions Xidel is one of them.
I've saved your XML file as 'so_49592301.xml'.
The legend, as a string, is easy:
$ ./xidel -s so_49592301.xml -e '"road,way_id,seq_num,node_ref_id"'
Next you select the <way>
element node, but only those that hold a <tag>
child node with the attribute k="name"
:
$ ./xidel -s so_49592301.xml -e '"road,way_id,seq_num,node_ref_id"' -e '//way[tag[@k="name"]]'
Next you select the <nd>
child node and do a string join on the index and the ref
attribute, with a comma as separator:
$ ./xidel -s so_49592301.xml -e '"road,way_id,seq_num,node_ref_id"' -e '//way[tag[@k="name"]]/nd/join((position(),@ref),",")'
road,way_id,seq_num,node_ref_id
1,89705976
2,89798118
3,89798120
4,89798122
5,89798124
6,89798126
7,89798128
8,89798130
9,89804458
10,89804460
11,89804463
12,89804464
13,89804466
14,89804468
Notice the index doesn't start over with the next <way>
element node? This can easily be fixed by putting nd/...
between parenthesis:
$ ./xidel -s so_49592301.xml -e '"road,way_id,seq_num,node_ref_id"' -e '//way[tag[@k="name"]]/(nd/join((position(),@ref),","))'
road,way_id,seq_num,node_ref_id
1,89705976
2,89798118
3,89798120
4,89798122
5,89798124
6,89798126
7,89798128
8,89798130
1,89804458
2,89804460
3,89804463
4,89804464
5,89804466
6,89804468
Next you include the v
attribute from the <tag k="name">
child node and the id
attribute from the <way>
element node. You're inside the <nd>
child node however, so to include stuff 1 level higher you have to prepend ../
:
$ ./xidel -s "so_49592301.xml" -e '"road,way_id,seq_num,node_ref_id"' -e '//way[tag[@k="name"]]/(nd/join((../tag[@k="name"]/@v,../@id,position(),@ref),","))'
road,way_id,seq_num,node_ref_id
Engineer Road,10459706,1,89705976
Engineer Road,10459706,2,89798118
Engineer Road,10459706,3,89798120
Engineer Road,10459706,4,89798122
Engineer Road,10459706,5,89798124
Engineer Road,10459706,6,89798126
Engineer Road,10459706,7,89798128
Engineer Road,10459706,8,89798130
5th Cutoff Street,10461171,1,89804458
5th Cutoff Street,10461171,2,89804460
5th Cutoff Street,10461171,3,89804463
5th Cutoff Street,10461171,4,89804464
5th Cutoff Street,10461171,5,89804466
5th Cutoff Street,10461171,6,89804468
And to make it more readable:
$ ./xidel -s "so_49592301.xml" \
> -e '"road,way_id,seq_num,node_ref_id"' \
> -e '//way[tag[@k="name"]]/(
> nd/join(
> (
> ../tag[@k="name"]/@v,
> ../@id,
> position(),
> @ref
> ),
> ","
> )
> )'
road,way_id,seq_num,node_ref_id
Engineer Road,10459706,1,89705976
Engineer Road,10459706,2,89798118
Engineer Road,10459706,3,89798120
Engineer Road,10459706,4,89798122
Engineer Road,10459706,5,89798124
Engineer Road,10459706,6,89798126
Engineer Road,10459706,7,89798128
Engineer Road,10459706,8,89798130
5th Cutoff Street,10461171,1,89804458
5th Cutoff Street,10461171,2,89804460
5th Cutoff Street,10461171,3,89804463
5th Cutoff Street,10461171,4,89804464
5th Cutoff Street,10461171,5,89804466
5th Cutoff Street,10461171,6,89804468
Upvotes: 0
Reputation: 185530
Don't parse XML/HTML with awk, use a proper XML/HTML parser and a powerful xpath query.
According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
You can use one of the following :
xmllint often installed by default with libxml2
, xpath1 (check my wrapper to have newlines delimited output
xmlstarlet can edit, select, transform... Not installed by default, xpath1
xpath installed via perl's module XML::XPath, xpath1
xidel xpath3
saxon-lint my own project, wrapper over @Michael Kay's Saxon-HE Java library, xpath3
python's lxml
(from lxml import etree
)
perl's XML::LibXML
, XML::XPath
, XML::Twig::XPath
, HTML::TreeBuilder::XPath
ruby nokogiri, check this example
php DOMXpath
, check this example
Check: Using regular expressions with HTML tags
Use this, based on xmlstarlet :
(before OP changed XML for a broken one)
<way id="10459706" version="2" timestamp="2010-03-27T18:21:32Z" changeset="424 7030" uid="20587" user="balrog-kun">
<nd ref="89705976"/>
<nd ref="89798118"/>
<nd ref="89798120"/>
<nd ref="89798122"/>
<nd ref="89798124"/>
<nd ref="89798126"/>
<nd ref="89798128"/>
<nd ref="89798130"/>
<tag k="highway" v="residential"/>
<tag k="name" v="Engineer Road"/>
<tag k="tiger:cfcc" v="A41"/>
<tag k="tiger:county" v="Livingston, CA"/>
<tag k="tiger:name_base" v="Engineer"/>
<tag k="tiger:name_type" v="Rd"/>
<tag k="tiger:reviewed" v="no"/>
<tag k="tiger:separated" v="no"/>
<tag k="tiger:source" v="tiger_import_dch_v0.6_20070809"/>
<tag k="tiger:tlid" v="196844016"/>
</way>
#!/bin/bash
IFS='|' read title id < <(
xmlstarlet sel -t -v '//tag[@k="name"]/@v' -o "|" -v '//way/@id' file
)
xmlstarlet sel -t -v '//nd/@ref' file | while read line; do
echo "$title,$id,$((++c)),$line"
done
Engineer Road,10459706,1,89705976
Engineer Road,10459706,2,89798118
Engineer Road,10459706,3,89798120
Engineer Road,10459706,4,89798122
Engineer Road,10459706,5,89798124
Engineer Road,10459706,6,89798126
Engineer Road,10459706,7,89798128
Upvotes: 2