Renato Cassino
Renato Cassino

Reputation: 840

Want to get only match in grep

I'm trying to use the grep command getting only the match.

I'm reading a XML file and I want to get URLs in tag location

<?xml>
<!-- ..... -->
<location>http://myurl.com/myuri/document</location>

I want to get only "http://myurl.com/myuri/document". I made this:

curl http://mywebsite.com/file.xml | grep "\<location\>"

And I received the full tag:

<location>http://myurl.com/myuri/document</location>
<location>http://myurl.com/myuri/document2</location>
<location>http://myurl.com/myuri/document3</location>

Now I want to get only the URL I made this:

curl http://mywebsite.com/file.xml | grep "\<location\>" | grep -oh ">.*<"

And I almost win haha

I received the URL with chars > and <

>http://myurl.com/myuri/document<

How can I get ONLY the match? For example (this example doesnt work)

curl http://mywebsite.com/file.xml | grep "\<location\>" | grep -oh ">(.*)<"
http://myurl.com/myuri/document

I want to use the var in wget after this. Like | wget $1

Upvotes: 2

Views: 861

Answers (4)

Jeffrey Cash
Jeffrey Cash

Reputation: 1073

The simplest solution I can think of is sed:

... | sed -e 's/^>//' -e 's/<$//'

This will get rid of the pointy brackets stuck on the url.

Upvotes: 0

Craig Taylor
Craig Taylor

Reputation: 1939

I wasn't able to get anubhava's version working so just experimenting I came up with the following – note that I've included the GNU version as I'm not sure if it's down to an issue with that.

I was a bit concerned about handling embedded XML tags in what was being searched for (probably not an issue with your example usage of location but looking at this as a more general problem). I also found I had to remove the <location>..</location> wrappers in the resulting text hence the two sed commands.

duck@lt-ctaylor-2:~/ateb/myx$ grep --version
grep (GNU grep) 2.24

duck@lt-ctaylor-2:~/ateb/myx$ cat tmp.tmp
<location><test>123</test></location>

duck@lt-ctaylor-2:~/ateb/myx$ cat tmp.tmp | grep -o '<location>.*</location>' | sed 's;<location>;;' | sed 's;</location>;;'
<test>123</test>

Upvotes: 0

anubhava
anubhava

Reputation: 785146

You can use -P option on gnu grep for PCRE regex:

curl http://mywebsite.com/file.xml | grep -oP '<location>\K[^<]+'

Or using awk:

curl http://mywebsite.com/file.xml | awk -F '</?location>' '/<location>/{print $2}'

http://myurl.com/myuri/document

Upvotes: 1

Jahid
Jahid

Reputation: 22428

grep with Perl regex:

grep -oP '(?<=<location>)[^<]+(?=</location>)'

Or

grep -o '[^<>]\+</location>' |grep -o '^[^<>]\+'

Or with sed:

sed -n 's#<location>\([^<]\+\)</location>#\1#p'

And if you want to download all these URLs, then:

curl http://mywebsite.com/file.xml | 
grep -o '[^<>]\+</location>' |grep -o '^[^<>]\+' | 
wget -ci -

Upvotes: 1

Related Questions