jefflovejapan
jefflovejapan

Reputation: 2121

Grab all instances of pattern from single line of text, edit, pipe out to line-separated text file

I have a block of text (single line) that is a list of URLs separated by tags and a bunch of other junk. I want to parse that whole block for URLs that match 'http.*">RSS', edit all the instances of that pattern (to get rid of everything after the glob), and pipe the whole thing out to a file as line-separated entries.

I thought I could do this with GREP (then edit and add new lines with SED), but GREP grabs matching lines, not matching patterns. Is there a different command I should be using? I've also tried using SED to add a newline (\n) ahead of the pattern wherever it occurs but that's not working either.

Edit: Here's an example of the data that I'm working with:

OUT</a> (<a href="https://evilcakes.wordpress.com/rss">RSS</a>)</li><li><a href="http://eater.com/" title="">Eater National</a> (<a href="http://feeds.feedburner.com/EaterNational">RSS</a>)</li><li><a href="http://www.foodtechconnect.com" title="">Food+Tech Connect</a> (<a href="http://feeds.feedburner.com/foodtechconnect">RSS</a>)</li><li><a href="http://www.innatthecrossroads.com" title="">Inn at the Crossroads</a> (<a href="http://innatthecrossroads.com/feed/">RSS</a>)</li><li><a href="http://www.seriouseats.com/" title="">Serious Eats</a> (<a href="http://feeds.seriouseats.com/seriouseatsfeaturesvideos">RSS</a>)</li><li><a href="http://www.thatsnerdalicious.com" title="">That's Nerdalicious!</a> (<a href="http://www.thatsnerdalicious.com/feed/">RSS</a>)</li><li><a href="http://thedrunkenmoogle.com/" title="">The Drunken Moogle</a> (<a href="http://www.thedrunkenmoogle.com/rss">RSS</a>)</li></ul></li><li><h2 class="entry-title">Comics</h2><ul class="opmlGroup"><li><a

Upvotes: 0

Views: 3696

Answers (4)

Thor
Thor

Reputation: 47189

Here's one way that works with GNU and BSD grep:

<infile grep -Eo 'https?://[^"]+">RSS*' | grep -Eo '^[^"]+'

Output:

https://evilcakes.wordpress.com/rss
http://feeds.feedburner.com/EaterNational
http://feeds.feedburner.com/foodtechconnect
http://innatthecrossroads.com/feed/
http://feeds.seriouseats.com/seriouseatsfeaturesvideos
http://www.thatsnerdalicious.com/feed/
http://www.thedrunkenmoogle.com/rss

Upvotes: 1

potong
potong

Reputation: 58483

This might work for you (GNU sed):

sed '/https\?:[^"]*/!d;s//\n&\n/;s/^[^\n]*\n//;P;D' file

Upvotes: 3

Steve
Steve

Reputation: 54512

Here's one way using GNU grep:

grep -oP 'http[^"]*(?=">RSS)' file

Results:

https://evilcakes.wordpress.com/rss
http://feeds.feedburner.com/EaterNational
http://feeds.feedburner.com/foodtechconnect
http://innatthecrossroads.com/feed/
http://feeds.seriouseats.com/seriouseatsfeaturesvideos
http://www.thatsnerdalicious.com/feed/
http://www.thedrunkenmoogle.com/rss

The options:

-o, --only-matching
    Print only the matched (non-empty) parts of a matching line, with each such 
    part on a separate output line.
-P, --perl-regexp
    Interpret PATTERN as a Perl regular expression. This is highly experimental
    and grep -P may warn of unimplemented features.

Also, you may like to read up on lookaround assertions. HTH.

EDIT:

Here's another way using awk:

awk -F\" '{ for(i=1;i<NF;i++) if ($(i+1) ~ /RSS/) print $i }' file

Results:

https://evilcakes.wordpress.com/rss
http://feeds.feedburner.com/EaterNational
http://feeds.feedburner.com/foodtechconnect
http://innatthecrossroads.com/feed/
http://feeds.seriouseats.com/seriouseatsfeaturesvideos
http://www.thatsnerdalicious.com/feed/
http://www.thedrunkenmoogle.com/rss

Upvotes: 3

ddoxey
ddoxey

Reputation: 2063

I put your sample data in urls.dat.

cat urls.dat | awk '{n=split($0,a,"\""); for (i=1;i<=n;i++) if ( match( a[i], "^http" ) ) print a[i]; }'

Upvotes: 1

Related Questions