Reputation: 300
i am trying to write a bashscript that can extract info out of a large html file. I need this to automatically download the latest newspaper every morning :). To download the latest newspaper I have to know the ID of it. To get it, I have to parse the link, that points to it. I managed to extract the line, that holds the id with awk with
awk '/show.php\?id=/' index.html
and get
<a href="show.php?id=914826">Latest Newspaper</a>
So what I need out of this line is "914826". This is where I am stuck... I don't think I can use awk to extract not the whole line, but a fragment.
Looking forward to your answers. Thanks in advance, Simon
Upvotes: 0
Views: 1446
Reputation: 36282
This complete awk
command should work. For lines that match the regexp, split in =
and "
. Splitting like this in your example line:
<a href=
show.php?id
914826
>Latest Newspaper</a>
So print the fourth one (arr[4]
):
awk '
/show.php\?id=/ {
split( $0, arr, /[="]/ );
print arr[4]
}
' index.html
Upvotes: 1
Reputation: 64613
Use grep
:
grep -o 'id=[0-9]*'
Example:
$ echo '<a href="show.php?id=914826">Latest Newspaper</a>' | grep -o 'id=[0-9]*'
id=914826
The same you can do with perl or sed:
$ echo '<a href="show.php?id=914826">Latest Newspaper</a>' | perl -pe 's/.*id=([0-9]*).*/$1/'
914826
Upvotes: 3