Reputation: 19628
Hi how to use sed or awk to extract substring that matches a regular expression.
I have seen several modify or change substring but I just want to get the matching part.
my data looks like below:
<loc>http://www.A.com/sitemap1.gz</loc>
<loc>http://www.A.com/sitemap2.gz</loc>
<loc>http://www.A.com/sitemap3.gz</loc>
<loc>http://www.A.com/sitemap4.gz</loc>
<loc>http://www.A.com/sitemap5.gz</loc>
<loc>http://www.A.com/sitemap6.gz</loc>
<loc>http://www.A.com/sitemap7.gz</loc>
<loc>http://www.A.com/sitemap8.gz</loc>
Output should look like
http://www.A.com/sitemap1.gz
http://www.A.com/sitemap2.gz
http://www.A.com/sitemap3.gz
....
I tried
cat data | sed 's/'http.*gz'//'
but this command actually removes exactly the part that I want to keep. Thanks
Upvotes: 0
Views: 2414
Reputation: 85795
A simple grep
will do with the -o
option:
$ grep -o 'http[^<]*' file
http://www.A.com/sitemap1.gz
http://www.A.com/sitemap2.gz
http://www.A.com/sitemap3.gz
http://www.A.com/sitemap4.gz
http://www.A.com/sitemap5.gz
http://www.A.com/sitemap6.gz
http://www.A.com/sitemap7.gz
http://www.A.com/sitemap8.gz
With awk
you could do:
$ awk -F'[<>]' '{print $3}' file
http://www.A.com/sitemap1.gz
http://www.A.com/sitemap2.gz
http://www.A.com/sitemap3.gz
http://www.A.com/sitemap4.gz
http://www.A.com/sitemap5.gz
http://www.A.com/sitemap6.gz
http://www.A.com/sitemap7.gz
http://www.A.com/sitemap8.gz
Upvotes: 4
Reputation: 785246
This sed should work:
sed 's/^.*\(http.*gz\).*$/\1/' file
OR grep -P (--perl-regexp) can also do the job:
grep -Po '(?<=<loc>).*?(?=</loc>)' file
Upvotes: 2