Reputation: 1
Here is my problem: I have an xml file which is too heavy (60 MB) so I'd like to retrieve the text between a specific tag.
My xml is like this : So I would retrieve the tags that are in PrimaryCategory where PrimaryCategoryID = 3
<PrimaryCategory>
<PrimaryCategoryID>3</PrimaryCategoryID>
<PrimaryCategoryName>Billets de concert</PrimaryCategoryName>
<PrimaryCategoryURL>http://www.viagogo.fr/Billets-de-concert</PrimaryCategoryURL>
<CategoryList>
<CategoryID>13632</CategoryID>
<CategoryName>Ron Sexsmith</CategoryName>
<CategoryURL>http://www.viagogo.fr/Billets-de-concert/Pop-Rock/Ron-Sexsmith-Billets</CategoryURL>
<CategoryImageURL>http://cdn1.viagogo.net/img/cat/1207/2/1.jpg</CategoryImageURL>
<CategoryDescription />
</CategoryList>
<CategoryList>
<CategoryID>27605</CategoryID>
<CategoryName>Theme Park</CategoryName>
<CategoryURL>http://www.blalbalbla.com</CategoryURL>
<CategoryImageURL>http://www.blalbalbla.com</CategoryImageURL>
<CategoryDescription />
</CategoryList>
<CategoryList>
<CategoryID>21935</CategoryID>
<CategoryName>Idina Menzel</CategoryName>
<CategoryURL>http://www.blalbalbla.com</CategoryURL>
<CategoryImageURL>http://www.blalbalbla.com</CategoryImageURL>
<CategoryDescription />
<EventList>
<EventID>740520</EventID>
<EventName>Idina Menzel</EventName>
<EventDate>2015-06-26T20:00:00</EventDate>
<EventURL>http://www.blalbalbla.com</EventURL>
<VenueID>175</VenueID>
<VenueName>Bournemouth International Centre (BIC)</VenueName>
<VenueAddress>Exeter Road</VenueAddress>
<VenueCity>Bournemouth</VenueCity>
<VenueState />
<VenueCountryCode>GB</VenueCountryCode>
<VenuePostCode>BH2 5BH</VenuePostCode>
<MinCurrentPrice>90.4500</MinCurrentPrice>
<MaxCurrentPrice>213.0700</MaxCurrentPrice>
<AvailableTickets>14</AvailableTickets>
<OnSaleDate>2014-12-03T18:24:00</OnSaleDate>
</EventList>
<PrimaryCategory>
<PrimaryCategoryID>2</PrimaryCategoryID>
<PrimaryCategoryName>concert</PrimaryCategoryName>
<PrimaryCategoryURL>http://www.blalbalbla.com</PrimaryCategoryURL>
<CategoryList>
<CategoryID>13632</CategoryID>
<CategoryName>Ron Sexsmith</CategoryName>
<CategoryURL>http://www.blalbalbla.com</CategoryURL>
<CategoryImageURL>http://www.blalbalbla.com</CategoryImageURL>
<CategoryDescription />
</CategoryList>
<CategoryList>
<CategoryID>25605</CategoryID>
<CategoryName>blablabal</CategoryName>
<CategoryURL>http://www.blalbalbla.coms</CategoryURL>
<CategoryImageURL>http://www.blalbalbla.com</CategoryImageURL>
<CategoryDescription />
</CategoryList>
<CategoryList>
<CategoryID>21935</CategoryID>
<CategoryName>Idina Menzel</CategoryName>
<CategoryURL>hhttp://www.blalbalbla.com</CategoryURL>
<CategoryImageURL>http://www.blalbalbla.com</CategoryImageURL>
<CategoryDescription />
<EventList>
<EventID>749820</EventID>
<EventName>Idina Menzel</EventName>
<EventDate>2015-06-26T20:00:00</EventDate>
<EventURL>http://www.blalbalbla.com0</EventURL>
<VenueID>175</VenueID>
<VenueName>Bournemouth International Centre (BIC)</VenueName>
<VenueAddress>Exeter Road</VenueAddress>
<VenueCity>Bournemouth</VenueCity>
<VenueState />
<VenueCountryCode>GB</VenueCountryCode>
<VenuePostCode>BH2 5BH</VenuePostCode>
<MinCurrentPrice>90.4500</MinCurrentPrice>
<MaxCurrentPrice>213.0700</MaxCurrentPrice>
<AvailableTickets>14</AvailableTickets>
<OnSaleDate>2014-12-03T18:24:00</OnSaleDate>
</EventList>
</CategoryList>
</PrimaryCategory>
So I would retrieve the tags that are in PrimaryCategory where PrimaryCategoryID = 3
Upvotes: 0
Views: 90
Reputation: 41460
This gnu awk
(due to multiple characters in Record Selector) should do:
awk -v RS="<PrimaryCategory>" '{split($1,a,"<|>")} a[3]==3 {print RT,$0}' file
<PrimaryCategory>
<PrimaryCategoryID>3</PrimaryCategoryID>
<PrimaryCategoryName>Billets de concert</PrimaryCategoryName>
<PrimaryCategoryURL>http://www.viagogo.fr/Billets-de-concert</PrimaryCategoryURL>
<CategoryList>
<CategoryID>13632</CategoryID>
<CategoryName>Ron Sexsmith</CategoryName>
<CategoryURL>http://www.viagogo.fr/Billets-de-concert/Pop-Rock/Ron-Sexsmith-Billets</CategoryURL>
<CategoryImageURL>http://cdn1.viagogo.net/img/cat/1207/2/1.jpg</CategoryImageURL>
<CategoryDescription />
</CategoryList>
<CategoryList>
<CategoryID>27605</CategoryID>
<CategoryName>Theme Park</CategoryName>
<CategoryURL>http://www.blalbalbla.com</CategoryURL>
<CategoryImageURL>http://www.blalbalbla.com</CategoryImageURL>
<CategoryDescription />
</CategoryList>
<CategoryList>
<CategoryID>21935</CategoryID>
<CategoryName>Idina Menzel</CategoryName>
<CategoryURL>http://www.blalbalbla.com</CategoryURL>
<CategoryImageURL>http://www.blalbalbla.com</CategoryImageURL>
<CategoryDescription />
<EventList>
<EventID>740520</EventID>
<EventName>Idina Menzel</EventName>
<EventDate>2015-06-26T20:00:00</EventDate>
<EventURL>http://www.blalbalbla.com</EventURL>
<VenueID>175</VenueID>
<VenueName>Bournemouth International Centre (BIC)</VenueName>
<VenueAddress>Exeter Road</VenueAddress>
<VenueCity>Bournemouth</VenueCity>
<VenueState />
<VenueCountryCode>GB</VenueCountryCode>
<VenuePostCode>BH2 5BH</VenuePostCode>
<MinCurrentPrice>90.4500</MinCurrentPrice>
<MaxCurrentPrice>213.0700</MaxCurrentPrice>
<AvailableTickets>14</AvailableTickets>
<OnSaleDate>2014-12-03T18:24:00</OnSaleDate>
</EventList>
It splits the file using <PrimaryCategory>
as record separator.
Then if field 1
contains number 3
, print separator and rest of line
Upvotes: 0
Reputation: 44063
Do not use line-based tools to handle XML, it will not work properly. Nobody expects your XML-handling code to break when whitespaces are shifted around, linebreaks inserted, or tags specified in a different order.
Instead, use a tool that parses XML properly and select with XPath. For example, with xmllint
:
xmllint --xpath '//PrimaryCategory[PrimaryCategoryID=3]' filename.xml
or with xmlstarlet
:
xmlstarlet sel -t -c '//PrimaryCategory[PrimaryCategoryID=3]' filename.xml
Note that this expects your input to be valid XML, which the snippet in your question is not (there are missing closing tags). I am working under the assumption that this is a copy/paste mistake.
Upvotes: 2
Reputation: 6702
Try this, It will retrieve PrimaryCategoryID
element's value from your xml
file as following,
grep -oP '(?<=>).*?(?=</PrimaryCategoryID>)' data.xml
Your output will be,
3
Upvotes: 0