Reputation: 413

AWK - How would I modify my AWK script to ignore the lines within a file that do not contain a matching pattern?

I have a text file that contains data in the following format. This is a sample of the data it contains. The file is correct and in the correct format:

 <node id="1647008557" lat="36.6536840" lon="-121.7938995" version="1" timestam  p="2012-02-25T14:03:54Z" changeset="10787766" uid="294728" user="skew-t">
  <tag k="highway" v="turning_circle"/>
  </node>
  <way id="10459706" version="2" timestamp="2010-03-27T18:21:32Z" changeset="4247030" uid="20587" user="balrog-kun">
    <nd ref="89705976"/>
    <nd ref="89798118"/>
    <nd ref="89798120"/>
    <nd ref="89798122"/>
    <nd ref="89798124"/>
    <nd ref="89798126"/>
    <nd ref="89798128"/>
    <nd ref="89798130"/>
    <tag k="highway" v="residential"/>
    <tag k="name" v="Engineer Road"/>
    <tag k="tiger:cfcc" v="A41"/>
    <tag k="tiger:county" v="Livingston, CA"/>
    <tag k="tiger:name_base" v="Engineer"/>
    <tag k="tiger:name_type" v="Rd"/>
    <tag k="tiger:reviewed" v="no"/>
    <tag k="tiger:separated" v="no"/>
    <tag k="tiger:source" v="tiger_import_dch_v0.6_20070809"/>
    <tag k="tiger:tlid" v="196844016"/>
  </way>
  <way id="10461171" version="3" timestamp="2014-01-07T00:17:59Z" changeset="19855176" uid="1871178" user="RBoggs">
    <nd ref="89804458"/>
    <nd ref="89804460"/>
    <nd ref="89804463"/>
    <nd ref="89804464"/>
    <nd ref="89804466"/>
    <nd ref="89804468"/>
    <tag k="access" v="no"/>
    <tag k="highway" v="residential"/>
    <tag k="motor_vehicle" v="no"/>
    <tag k="name" v="5th Cutoff Street"/>
    <tag k="tiger:cfcc" v="A41"/>
    <tag k="tiger:county" v="Marysville, CA"/>
    <tag k="tiger:name_base" v="5th Cutoff"/>
    <tag k="tiger:name_type" v="St"/>
    <tag k="tiger:reviewed" v="no"/>
    </way>
<way id="151860745" version="1" timestamp="2012-02-25T14:03:59Z" changeset="10787766" uid="294728" user="skew-t">
    <nd ref="1647008614"/>
    <nd ref="1647008545"/>
    <nd ref="1647008605"/>
    <nd ref="1647008555"/>
    <nd ref="1647008557"/>
    <tag k="highway" v="service"/>
  </way>

And I am trying to print out the name within the way id section along with the way id itself, the sequence number the nd ref is at, and the nd ref id.

Like so in the correct output:

$ awk -f table.awk file.txt | head
road,way_id,seq_num,node_ref_id
Engineer Road,10459706,1,89705976
Engineer Road,10459706,2,89798118
Engineer Road,10459706,3,89798120
Engineer Road,10459706,4,89798122
Engineer Road,10459706,5,89798124
Engineer Road,10459706,6,89798126
Engineer Road,10459706,7,89798128
Engineer Road,10459706,8,89798130
5th Cutoff Street,10461171,1,89804458
5th Cutoff Street,10461171,2,89804460
5th Cutoff Street,10461171,3,89804463
5th Cutoff Street,10461171,4,89804464
5th Cutoff Street,10461171,5,89804466
5th Cutoff Street,10461171,6,89804468

How would I print that output by ignoring the lines that do not contain <tag k="name" within the <way> tag?

Upvotes: 1

Answers (2)

Reino

Reputation: 3443

"Gilles Quenot" already told you to use a proper XML/HTML parser and he mentions Xidel is one of them.
I've saved your XML file as 'so_49592301.xml'.

The legend, as a string, is easy:

$ ./xidel -s so_49592301.xml -e '"road,way_id,seq_num,node_ref_id"'

Next you select the <way> element node, but only those that hold a <tag> child node with the attribute k="name":

$ ./xidel -s so_49592301.xml -e '"road,way_id,seq_num,node_ref_id"' -e '//way[tag[@k="name"]]'

Next you select the <nd> child node and do a string join on the index and the ref attribute, with a comma as separator:

$ ./xidel -s so_49592301.xml -e '"road,way_id,seq_num,node_ref_id"' -e '//way[tag[@k="name"]]/nd/join((position(),@ref),",")'
road,way_id,seq_num,node_ref_id
1,89705976
2,89798118
3,89798120
4,89798122
5,89798124
6,89798126
7,89798128
8,89798130
9,89804458
10,89804460
11,89804463
12,89804464
13,89804466
14,89804468

Notice the index doesn't start over with the next <way> element node? This can easily be fixed by putting nd/... between parenthesis:

$ ./xidel -s so_49592301.xml -e '"road,way_id,seq_num,node_ref_id"' -e '//way[tag[@k="name"]]/(nd/join((position(),@ref),","))'
road,way_id,seq_num,node_ref_id
1,89705976
2,89798118
3,89798120
4,89798122
5,89798124
6,89798126
7,89798128
8,89798130
1,89804458
2,89804460
3,89804463
4,89804464
5,89804466
6,89804468

Next you include the v attribute from the <tag k="name"> child node and the id attribute from the <way> element node. You're inside the <nd> child node however, so to include stuff 1 level higher you have to prepend ../:

$ ./xidel -s "so_49592301.xml" -e '"road,way_id,seq_num,node_ref_id"' -e '//way[tag[@k="name"]]/(nd/join((../tag[@k="name"]/@v,../@id,position(),@ref),","))'
road,way_id,seq_num,node_ref_id
Engineer Road,10459706,1,89705976
Engineer Road,10459706,2,89798118
Engineer Road,10459706,3,89798120
Engineer Road,10459706,4,89798122
Engineer Road,10459706,5,89798124
Engineer Road,10459706,6,89798126
Engineer Road,10459706,7,89798128
Engineer Road,10459706,8,89798130
5th Cutoff Street,10461171,1,89804458
5th Cutoff Street,10461171,2,89804460
5th Cutoff Street,10461171,3,89804463
5th Cutoff Street,10461171,4,89804464
5th Cutoff Street,10461171,5,89804466
5th Cutoff Street,10461171,6,89804468

And to make it more readable:

$ ./xidel -s "so_49592301.xml" \
> -e '"road,way_id,seq_num,node_ref_id"' \
> -e '//way[tag[@k="name"]]/(
>       nd/join(
>         (
>           ../tag[@k="name"]/@v,
>           ../@id,
>           position(),
>           @ref
>         ),
>         ","
>       )
>     )'
road,way_id,seq_num,node_ref_id
Engineer Road,10459706,1,89705976
Engineer Road,10459706,2,89798118
Engineer Road,10459706,3,89798120
Engineer Road,10459706,4,89798122
Engineer Road,10459706,5,89798124
Engineer Road,10459706,6,89798126
Engineer Road,10459706,7,89798128
Engineer Road,10459706,8,89798130
5th Cutoff Street,10461171,1,89804458
5th Cutoff Street,10461171,2,89804460
5th Cutoff Street,10461171,3,89804463
5th Cutoff Street,10461171,4,89804464
5th Cutoff Street,10461171,5,89804466
5th Cutoff Street,10461171,6,89804468

Upvotes: 0

Gilles Quénot

Reputation: 185530

Don't parse XML/HTML with awk, use a proper XML/HTML parser and a powerful xpath query.

theory :

According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.

realLife©®™ everyday tool in a shell :

You can use one of the following :

xmllint often installed by default with libxml2, xpath1 (check my wrapper to have newlines delimited output

xmlstarlet can edit, select, transform... Not installed by default, xpath1

xpath installed via perl's module XML::XPath, xpath1

xidel xpath3

saxon-lint my own project, wrapper over @Michael Kay's Saxon-HE Java library, xpath3

or you can use high level languages and proper libs, I think of :

python's lxml (from lxml import etree)

perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath

ruby nokogiri, check this example

php DOMXpath, check this example

Check: Using regular expressions with HTML tags

Example using xpath :

Use this, based on xmlstarlet :

File :

(before OP changed XML for a broken one)

  <way id="10459706" version="2" timestamp="2010-03-27T18:21:32Z" changeset="424 7030" uid="20587" user="balrog-kun">
    <nd ref="89705976"/>
    <nd ref="89798118"/>
    <nd ref="89798120"/>
    <nd ref="89798122"/>
    <nd ref="89798124"/>
    <nd ref="89798126"/>
    <nd ref="89798128"/>
    <nd ref="89798130"/>
    <tag k="highway" v="residential"/>
    <tag k="name" v="Engineer Road"/>
    <tag k="tiger:cfcc" v="A41"/>
    <tag k="tiger:county" v="Livingston, CA"/>
    <tag k="tiger:name_base" v="Engineer"/>
    <tag k="tiger:name_type" v="Rd"/>
    <tag k="tiger:reviewed" v="no"/>
    <tag k="tiger:separated" v="no"/>
    <tag k="tiger:source" v="tiger_import_dch_v0.6_20070809"/>
    <tag k="tiger:tlid" v="196844016"/>
  </way>

Code :

#!/bin/bash

IFS='|' read title id < <(
    xmlstarlet sel -t -v '//tag[@k="name"]/@v' -o "|" -v '//way/@id' file
)
xmlstarlet sel -t -v '//nd/@ref' file | while read line; do
    echo "$title,$id,$((++c)),$line"
done

Output :

Engineer Road,10459706,1,89705976
Engineer Road,10459706,2,89798118
Engineer Road,10459706,3,89798120
Engineer Road,10459706,4,89798122
Engineer Road,10459706,5,89798124
Engineer Road,10459706,6,89798126
Engineer Road,10459706,7,89798128

Upvotes: 2