Reputation: 840
I'm trying to use the grep command getting only the match.
I'm reading a XML file and I want to get URLs in tag location
<?xml>
<!-- ..... -->
<location>http://myurl.com/myuri/document</location>
I want to get only "http://myurl.com/myuri/document". I made this:
curl http://mywebsite.com/file.xml | grep "\<location\>"
And I received the full tag:
<location>http://myurl.com/myuri/document</location>
<location>http://myurl.com/myuri/document2</location>
<location>http://myurl.com/myuri/document3</location>
Now I want to get only the URL I made this:
curl http://mywebsite.com/file.xml | grep "\<location\>" | grep -oh ">.*<"
And I almost win haha
I received the URL with chars > and <
>http://myurl.com/myuri/document<
How can I get ONLY the match? For example (this example doesnt work)
curl http://mywebsite.com/file.xml | grep "\<location\>" | grep -oh ">(.*)<"
http://myurl.com/myuri/document
I want to use the var in wget after this. Like | wget $1
Upvotes: 2
Views: 861
Reputation: 1073
The simplest solution I can think of is sed:
... | sed -e 's/^>//' -e 's/<$//'
This will get rid of the pointy brackets stuck on the url.
Upvotes: 0
Reputation: 1939
I wasn't able to get anubhava's version working so just experimenting I came up with the following – note that I've included the GNU version as I'm not sure if it's down to an issue with that.
I was a bit concerned about handling embedded XML tags in what was being searched for (probably not an issue with your example usage of location but looking at this as a more general problem). I also found I had to remove the <location>..</location>
wrappers in the resulting text hence the two sed commands.
duck@lt-ctaylor-2:~/ateb/myx$ grep --version
grep (GNU grep) 2.24
duck@lt-ctaylor-2:~/ateb/myx$ cat tmp.tmp
<location><test>123</test></location>
duck@lt-ctaylor-2:~/ateb/myx$ cat tmp.tmp | grep -o '<location>.*</location>' | sed 's;<location>;;' | sed 's;</location>;;'
<test>123</test>
Upvotes: 0
Reputation: 785146
You can use -P
option on gnu grep
for PCRE regex:
curl http://mywebsite.com/file.xml | grep -oP '<location>\K[^<]+'
Or using awk:
curl http://mywebsite.com/file.xml | awk -F '</?location>' '/<location>/{print $2}'
http://myurl.com/myuri/document
Upvotes: 1
Reputation: 22428
grep with Perl regex:
grep -oP '(?<=<location>)[^<]+(?=</location>)'
Or
grep -o '[^<>]\+</location>' |grep -o '^[^<>]\+'
Or with sed:
sed -n 's#<location>\([^<]\+\)</location>#\1#p'
And if you want to download all these URLs, then:
curl http://mywebsite.com/file.xml |
grep -o '[^<>]\+</location>' |grep -o '^[^<>]\+' |
wget -ci -
Upvotes: 1