Extract info out of html via bash

Question

i am trying to write a bashscript that can extract info out of a large html file. I need this to automatically download the latest newspaper every morning :). To download the latest newspaper I have to know the ID of it. To get it, I have to parse the link, that points to it. I managed to extract the line, that holds the id with awk with

awk '/show.php\?id=/' index.html

and get

Latest Newspaper

So what I need out of this line is "914826". This is where I am stuck... I don't think I can use awk to extract not the whole line, but a fragment.

Looking forward to your answers. Thanks in advance, Simon

Birei · Accepted Answer

This complete awk command should work. For lines that match the regexp, split in = and ". Splitting like this in your example line:

First field would be:


Second field: [blank]
Third field: show.php?id
Fourth field: 914826
And fifth field: >Latest Newspaper



So print the fourth one (arr[4]):

awk '
    /show.php\?id=/ { 
        split( $0, arr, /[="]/ ); 
        print arr[4] 
    }
' index.html

Extract info out of html via bash

Answers (2)

Related Questions