Reputation: 221
here's my problem: I try to parse a xml feed and extract two fields (title and link) - this part is working fine. How can I remove all html tags and save the result in a csv format e.g.
title,link
title,link
title,link
#!/bin/sh
url="http://www.buzzfeed.com/usnews.xml"
curl --silent "$url" | grep -E '(title>|link>)' >> output
Upvotes: 1
Views: 609
Reputation: 246837
Use an XML parser to parse XML. I assume you want the title and link for the feed items, not for the feed itself.
curl --silent "$url" |
xmlstarlet sel -t -m '/rss/channel/item' -v 'title' -n -v 'link' -n |
awk '{
title=$0
gsub(/"/, "&&", title)
getline
printf "\"%s\",\"%s\"\n", title, $0
}'
The xmlstarlet command parses the feed, and for each /rss/channel/item
outputs the title value and the link value on separate lines. Then awk picks up the stream and massages it into CSV.
Just for fun, a sed version of that awk:
sed -n 's/"/&&/g;s/^\|$/"/g;h;n;s/"/&&/g;s/^\|$/"/g;x;G;s/\n/,/;p'
or
sed -n ' # do not automatically print
# current line is the title
s/"/&&/g # double up any double quotes (CSV quote escaping)
s/^\|$/"/g # add leading and trailing double quotes
h # store current pattern space (title) into hold space
n # read the next line (the link) from input
s/"/&&/g # double up any double quotes (CSV quote escaping)
s/^\|$/"/g # add leading and trailing double quotes
x # exchange pattern space (link) and hold space (title)
G # append a newline to title and then append link
s/\n/,/ # replace the newline with a comma
p # and print it
'
Upvotes: 2