Reputation: 1

Retrieving the text between a specific XML tag using AWK

Here is my problem: I have an xml file which is too heavy (60 MB) so I'd like to retrieve the text between a specific tag.

My xml is like this : So I would retrieve the tags that are in PrimaryCategory where PrimaryCategoryID = 3

<PrimaryCategory>
    <PrimaryCategoryID>3</PrimaryCategoryID>
    <PrimaryCategoryName>Billets de concert</PrimaryCategoryName>
    <PrimaryCategoryURL>http://www.viagogo.fr/Billets-de-concert</PrimaryCategoryURL>
    <CategoryList>
      <CategoryID>13632</CategoryID>
      <CategoryName>Ron Sexsmith</CategoryName>
      <CategoryURL>http://www.viagogo.fr/Billets-de-concert/Pop-Rock/Ron-Sexsmith-Billets</CategoryURL>
      <CategoryImageURL>http://cdn1.viagogo.net/img/cat/1207/2/1.jpg</CategoryImageURL>
      <CategoryDescription />
    </CategoryList>
    <CategoryList>
      <CategoryID>27605</CategoryID>
      <CategoryName>Theme Park</CategoryName>
      <CategoryURL>http://www.blalbalbla.com</CategoryURL>
      <CategoryImageURL>http://www.blalbalbla.com</CategoryImageURL>
      <CategoryDescription />
    </CategoryList>
    <CategoryList>
      <CategoryID>21935</CategoryID>
      <CategoryName>Idina Menzel</CategoryName>
      <CategoryURL>http://www.blalbalbla.com</CategoryURL>
      <CategoryImageURL>http://www.blalbalbla.com</CategoryImageURL>
      <CategoryDescription />
      <EventList>
        <EventID>740520</EventID>
        <EventName>Idina Menzel</EventName>
        <EventDate>2015-06-26T20:00:00</EventDate>
        <EventURL>http://www.blalbalbla.com</EventURL>
        <VenueID>175</VenueID>
        <VenueName>Bournemouth International Centre (BIC)</VenueName>
        <VenueAddress>Exeter Road</VenueAddress>
        <VenueCity>Bournemouth</VenueCity>
        <VenueState />
        <VenueCountryCode>GB</VenueCountryCode>
        <VenuePostCode>BH2 5BH</VenuePostCode>
        <MinCurrentPrice>90.4500</MinCurrentPrice>
        <MaxCurrentPrice>213.0700</MaxCurrentPrice>
        <AvailableTickets>14</AvailableTickets>
        <OnSaleDate>2014-12-03T18:24:00</OnSaleDate>
      </EventList>
<PrimaryCategory>
    <PrimaryCategoryID>2</PrimaryCategoryID>
    <PrimaryCategoryName>concert</PrimaryCategoryName>
    <PrimaryCategoryURL>http://www.blalbalbla.com</PrimaryCategoryURL>
    <CategoryList>
      <CategoryID>13632</CategoryID>
      <CategoryName>Ron Sexsmith</CategoryName>
      <CategoryURL>http://www.blalbalbla.com</CategoryURL>
      <CategoryImageURL>http://www.blalbalbla.com</CategoryImageURL>
      <CategoryDescription />
    </CategoryList>
    <CategoryList>
      <CategoryID>25605</CategoryID>
      <CategoryName>blablabal</CategoryName>
      <CategoryURL>http://www.blalbalbla.coms</CategoryURL>
      <CategoryImageURL>http://www.blalbalbla.com</CategoryImageURL>
      <CategoryDescription />
    </CategoryList>
    <CategoryList>
      <CategoryID>21935</CategoryID>
      <CategoryName>Idina Menzel</CategoryName>
      <CategoryURL>hhttp://www.blalbalbla.com</CategoryURL>
      <CategoryImageURL>http://www.blalbalbla.com</CategoryImageURL>
      <CategoryDescription />
      <EventList>
        <EventID>749820</EventID>
        <EventName>Idina Menzel</EventName>
        <EventDate>2015-06-26T20:00:00</EventDate>
        <EventURL>http://www.blalbalbla.com0</EventURL>
        <VenueID>175</VenueID>
        <VenueName>Bournemouth International Centre (BIC)</VenueName>
        <VenueAddress>Exeter Road</VenueAddress>
        <VenueCity>Bournemouth</VenueCity>
        <VenueState />
        <VenueCountryCode>GB</VenueCountryCode>
        <VenuePostCode>BH2 5BH</VenuePostCode>
        <MinCurrentPrice>90.4500</MinCurrentPrice>
        <MaxCurrentPrice>213.0700</MaxCurrentPrice>
        <AvailableTickets>14</AvailableTickets>
        <OnSaleDate>2014-12-03T18:24:00</OnSaleDate>
      </EventList>
    </CategoryList>
</PrimaryCategory>

So I would retrieve the tags that are in PrimaryCategory where PrimaryCategoryID = 3

Upvotes: 0

Answers (3)

Jotne

Reputation: 41460

This gnu awk (due to multiple characters in Record Selector) should do:

awk -v RS="<PrimaryCategory>" '{split($1,a,"<|>")} a[3]==3 {print RT,$0}' file
<PrimaryCategory>
    <PrimaryCategoryID>3</PrimaryCategoryID>
    <PrimaryCategoryName>Billets de concert</PrimaryCategoryName>
    <PrimaryCategoryURL>http://www.viagogo.fr/Billets-de-concert</PrimaryCategoryURL>
    <CategoryList>
      <CategoryID>13632</CategoryID>
      <CategoryName>Ron Sexsmith</CategoryName>
      <CategoryURL>http://www.viagogo.fr/Billets-de-concert/Pop-Rock/Ron-Sexsmith-Billets</CategoryURL>
      <CategoryImageURL>http://cdn1.viagogo.net/img/cat/1207/2/1.jpg</CategoryImageURL>
      <CategoryDescription />
    </CategoryList>
    <CategoryList>
      <CategoryID>27605</CategoryID>
      <CategoryName>Theme Park</CategoryName>
      <CategoryURL>http://www.blalbalbla.com</CategoryURL>
      <CategoryImageURL>http://www.blalbalbla.com</CategoryImageURL>
      <CategoryDescription />
    </CategoryList>
    <CategoryList>
      <CategoryID>21935</CategoryID>
      <CategoryName>Idina Menzel</CategoryName>
      <CategoryURL>http://www.blalbalbla.com</CategoryURL>
      <CategoryImageURL>http://www.blalbalbla.com</CategoryImageURL>
      <CategoryDescription />
      <EventList>
        <EventID>740520</EventID>
        <EventName>Idina Menzel</EventName>
        <EventDate>2015-06-26T20:00:00</EventDate>
        <EventURL>http://www.blalbalbla.com</EventURL>
        <VenueID>175</VenueID>
        <VenueName>Bournemouth International Centre (BIC)</VenueName>
        <VenueAddress>Exeter Road</VenueAddress>
        <VenueCity>Bournemouth</VenueCity>
        <VenueState />
        <VenueCountryCode>GB</VenueCountryCode>
        <VenuePostCode>BH2 5BH</VenuePostCode>
        <MinCurrentPrice>90.4500</MinCurrentPrice>
        <MaxCurrentPrice>213.0700</MaxCurrentPrice>
        <AvailableTickets>14</AvailableTickets>
        <OnSaleDate>2014-12-03T18:24:00</OnSaleDate>
      </EventList>

It splits the file using <PrimaryCategory> as record separator.
Then if field 1 contains number 3, print separator and rest of line

Upvotes: 0

Wintermute

Reputation: 44063

Do not use line-based tools to handle XML, it will not work properly. Nobody expects your XML-handling code to break when whitespaces are shifted around, linebreaks inserted, or tags specified in a different order.

Instead, use a tool that parses XML properly and select with XPath. For example, with xmllint:

xmllint --xpath '//PrimaryCategory[PrimaryCategoryID=3]' filename.xml

or with xmlstarlet:

xmlstarlet sel -t -c '//PrimaryCategory[PrimaryCategoryID=3]' filename.xml

Note that this expects your input to be valid XML, which the snippet in your question is not (there are missing closing tags). I am working under the assumption that this is a copy/paste mistake.

Upvotes: 2

Arnab Nandy

Reputation: 6702

Try this, It will retrieve PrimaryCategoryID element's value from your xml file as following,

grep -oP '(?<=>).*?(?=</PrimaryCategoryID>)' data.xml

Your output will be,

Upvotes: 0

Retrieving the text between a specific XML tag using AWK

Answers (3)

Related Questions