B.Mr.W.
B.Mr.W.

Reputation: 19628

sed awk get substring instead - regex

Hi how to use sed or awk to extract substring that matches a regular expression.

I have seen several modify or change substring but I just want to get the matching part.

my data looks like below:

<loc>http://www.A.com/sitemap1.gz</loc>
<loc>http://www.A.com/sitemap2.gz</loc>
<loc>http://www.A.com/sitemap3.gz</loc>
<loc>http://www.A.com/sitemap4.gz</loc>
<loc>http://www.A.com/sitemap5.gz</loc>
<loc>http://www.A.com/sitemap6.gz</loc>
<loc>http://www.A.com/sitemap7.gz</loc>
<loc>http://www.A.com/sitemap8.gz</loc>

Output should look like

http://www.A.com/sitemap1.gz
http://www.A.com/sitemap2.gz
http://www.A.com/sitemap3.gz
....

I tried

cat data | sed 's/'http.*gz'//' 

but this command actually removes exactly the part that I want to keep. Thanks

Upvotes: 0

Views: 2414

Answers (2)

Chris Seymour
Chris Seymour

Reputation: 85795

A simple grep will do with the -o option:

$ grep -o 'http[^<]*' file
http://www.A.com/sitemap1.gz
http://www.A.com/sitemap2.gz
http://www.A.com/sitemap3.gz
http://www.A.com/sitemap4.gz
http://www.A.com/sitemap5.gz
http://www.A.com/sitemap6.gz
http://www.A.com/sitemap7.gz
http://www.A.com/sitemap8.gz

With awk you could do:

$ awk -F'[<>]' '{print $3}' file
http://www.A.com/sitemap1.gz
http://www.A.com/sitemap2.gz
http://www.A.com/sitemap3.gz
http://www.A.com/sitemap4.gz
http://www.A.com/sitemap5.gz
http://www.A.com/sitemap6.gz
http://www.A.com/sitemap7.gz
http://www.A.com/sitemap8.gz

Upvotes: 4

anubhava
anubhava

Reputation: 785246

This sed should work:

sed 's/^.*\(http.*gz\).*$/\1/' file

OR grep -P (--perl-regexp) can also do the job:

grep -Po '(?<=<loc>).*?(?=</loc>)' file

Upvotes: 2

Related Questions