Said Savci
Said Savci

Reputation: 858

Parse HTML Using AWK

I have the following HTML strcuture and want to extract data from it using the awk.

<body>
<div>...</div>
<div>...</div>
<div class="body-content">
    <div>...</div>
    <div class="product-list" class="container">
        <div class="w3-row" id="product-list-row">
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product A</div>
                    <div class="product-price">100,56</div>
                </div>
            </div>
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product B</div>
                    <div class="product-price">200,56</div>
                </div>
            </div>
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product C</div>
                    <div class="product-price">300,56</div>
                </div>
            </div>
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product D</div>
                    <div class="product-price">400,56</div>
                </div>
            </div>
        </div>
    </div>
</div>
</body>

The result I want to have is as follows.

100,56
200,56
300,56
400,56

I was experimenting with the following awk script (I know it makes no sense to select product-price twice, I was about to modify this script)

awk -F '<[^>]+>' 'found { sub(/^[[:space:]]*/,";"); print title $0; found=0 } /<div class="product-price">/ { title=$2 } /<div class="product-price">/  { found=1 }'

but it gives me the result

100,56                </div>
200,56                </div>
300,56                </div>
400,56                </div>

I never used awk before, so can't just figure out what is wrong here or how to modify the above code. How would you do this?

Upvotes: 2

Views: 2917

Answers (5)

Reino
Reino

Reputation: 3423

It baffles me that time and time again people try to parse HTML, not with an HTML parser, but with a tool that doesn't understand HTML at all in general and with RegEx in particular!
With an HTML parser like it's as simple as:

$ xidel -s "<url> or input.html" -e '//div[@class="product-price"]'

Upvotes: 2

Daweo
Daweo

Reputation: 36370

How would you do this?

If possible use tool designed for dealing with HTML, which GNU AWK is not.

If you are allowed to install then use hxselect it does process standard input and understand (subset) of CSS selectors, so in this case something like:

echo file.html | hxselect -i -c -s '\n' div.product-price

should give you desired result (disclaimer: I do not have ability to test it)

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133428

If someone is looking for Python related solution, I would suggest use beautifulsoup library of Python, following is written and tested in Python3.8. To segregate it from my previous answer I am adding another answer here.

#!/bin/python3
##import library here.  
from bs4 import BeautifulSoup
##Read Input_file and get its all contents.
with open('Input_file', 'r') as f:
    contents = f.read()
    f.close()
##Get contents in form of xml in soup variable here.
soup = BeautifulSoup(contents, 'lxml')
##get only those values which specifically needed by OP of div class.
vals = (soup.find_all("div", {"class": "product-price"}))
##Print actual values out of tags.
for val in vals:
    print (val.text)

NOTE:

  • One should have BeautifulSoup installed in Python along with install lxml with pip3 or pip depending upon your system.
  • Where Input_file is one where program is reading your all data.

Upvotes: 3

RavinderSingh13
RavinderSingh13

Reputation: 133428

With your shown samples/attempts, please try following awk code.

awk -F"[><]" '{gsub(/\r/,"")} /^[ \t]+<div[ \t]+class="product-price">.*<\/div>/{print $3}' Input_file

Explanation: Adding detailed explanation for above. This is only for explanation purposes for running code please use above one.

awk -F"[><]" '      ##Starting awk program from here and setting field separator as ><
{gsub(/\r/,"")}     ##Substituting control M chars at last of lines.
/^[ \t]+<div[ \t]+class="product-price">.*<\/div>/{ ##checking condition if line starts
                    ##from space followed by <div class=product-price"> till div close tag.
  print $3          ##printing 3rd column here.
}
' Input_file        ##Mentioning Input_file name here.

Changed regex to /^[ \t]+<div[ \t]+class as per Ed's suggestions in comments. Also its always recommended by experts to use xmlstarlet/xml aware tools in case someone has in their system.

Upvotes: 3

Ed Morton
Ed Morton

Reputation: 203209

The result of a quick google for xmlstarlet print div contents and then a few secs of trial and error:

$ xmlstarlet sel -t -m "//*[@class='product-price']" -v "." -n file
100,56
200,56
300,56
400,56

For an explanation - ask google :-).

Upvotes: 3

Related Questions