Reputation: 607

extract text from between html tags with specific id using sed or grep

What command should I be using to extract the text from within the following html code which sits in a "test.html" file containing : "<span id="imAnID">extractme</span>" ?

The file will be larger so I need to point grep or sed to an id and then tell it to extract only the text from the tag having this ID. Assuming I run the terminal from the directory where the file resides, I am doing this:

cat test.html | sed -n 's/.*<span id="imAnID">\(.*\)<\/span>.*/\1/p'

What am I doing wrong? I get an empty output... Not opposed to using grep for this if it's easier.

Upvotes: 0

Answers (4)

Nik O'Lai

Reputation: 3694

using grep -o

echo "<span id="imAnID" hello>extractme</span> <span id='imAnID'>extractmetoo</span>" | grep -oE 'id=.?imAnID[^<>]*>[^<>]+' | cut -d'>' -f2

will find:

#=>extractme
#=>extractmetoo

it will work if the span element carrying the desired id attribute comes immediately before the extractme stuff.

Upvotes: 0

user1277476

Reputation: 2909

awk, sed and grep are line-oriented tools. XML and HTML are based on tags. The two don't combine that well, though you can get by with awk, sed and grep on XML and HTML by using a pretty formatter on the XML or HTML before resorting to your line-oriented tools.

There's a program called xmlgawk that is supposed to be quite gawk-like, while still working on XML.

I personally prefer to do this sort of thing in Python using the lxml module, so that the XML/HTML can be fully understood without getting too wordy.

Upvotes: 0

djhaskin987

Reputation: 10087

It is awkward to use awk, sed, or grep for this since these tools are line-based (one line at a time). Is it guaranteed that the span you are trying to extract is all on the same line? Is there any possibility of other tags used within the span (e.g. em tags)? If not, then this sounds like a job for perl.

Upvotes: 0

sampson-chen

Reputation: 47357

You can try doing it with awk instead:

  #!/bin/bash

  start_tag="span id=\"imAnID\""
  end_tag="/span"

  awk -F'[<>]' -v taga="$start_tag" -v tagb="$end_tag" '{ i=1; while (i<=NF) { if ($(i)==taga && $(i+2)==tagb) { print $(i+1) }; i++} }'

Use this by:

$ ./script < infile > outfile

Upvotes: 0

extract text from between html tags with specific id using sed or grep

Answers (4)

Related Questions