battlepope
battlepope

Reputation: 300

Extract info out of html via bash

i am trying to write a bashscript that can extract info out of a large html file. I need this to automatically download the latest newspaper every morning :). To download the latest newspaper I have to know the ID of it. To get it, I have to parse the link, that points to it. I managed to extract the line, that holds the id with awk with

awk '/show.php\?id=/' index.html

and get

<a href="show.php?id=914826">Latest Newspaper</a>

So what I need out of this line is "914826". This is where I am stuck... I don't think I can use awk to extract not the whole line, but a fragment.

Looking forward to your answers. Thanks in advance, Simon

Upvotes: 0

Views: 1446

Answers (2)

Birei
Birei

Reputation: 36282

This complete awk command should work. For lines that match the regexp, split in = and ". Splitting like this in your example line:

  • First field would be: <a href=
  • Second field: [blank]
  • Third field: show.php?id
  • Fourth field: 914826
  • And fifth field: >Latest Newspaper</a>

So print the fourth one (arr[4]):

awk '
    /show.php\?id=/ { 
        split( $0, arr, /[="]/ ); 
        print arr[4] 
    }
' index.html

Upvotes: 1

Igor Chubin
Igor Chubin

Reputation: 64613

Use grep:

grep -o 'id=[0-9]*'

Example:

$ echo '<a href="show.php?id=914826">Latest Newspaper</a>' | grep -o 'id=[0-9]*'
id=914826

The same you can do with perl or sed:

$ echo '<a href="show.php?id=914826">Latest Newspaper</a>' | perl -pe 's/.*id=([0-9]*).*/$1/'
914826

Upvotes: 3

Related Questions