Prashant
Prashant

Reputation: 402

skip the specific text inside the <td> tag , python beautifulsoup

this is my html file

<tr>
<td>1</td>
<td style="font-weight: bold;"><a href="#" onclick="javascript:TollPlazaPopup(272);"> Kherki Daula </a></td> 
<td style="font-weight: bold;">60 <a onclick="return popitup(" https:="" www.google.co.in="" maps="" @28.395604,76.98176,17.52z="" data="!5m1!1e1?hl=en')'" href="https://www.google.co.in/maps/@28.395604,76.98176,17.52z/data=!3m1!1e3!5m1!1e1?hl=en" target="_Blank"> (Live Traffic)</a> &nbsp;&nbsp; - &nbsp;&nbsp; <a href="#" title="Click here to get estimated travel time." id="0-232X" onclick="javascript:TollPlazaTrafficTime(272,this);">ET</a>
</td>
</tr>
<tr>
<td>2</td>
<td style="font-weight: bold;"><a href="#" onclick="javascript:TollPlazaPopup(213);"> Shahjahanpur </a></td>
<td style="font-weight: bold;">125 <a onclick="return popitup(" https:="" www.google.co.in="" maps="" @27.99978,76.430522,17.52z="" data="!5m1!1e1?hl=en')'" href="https://www.google.co.in/maps/@27.99978,76.430522,17.52z/data=!3m1!1e3!5m1!1e1?hl=en" target="_Blank"> (Live Traffic)</a> &nbsp;&nbsp; - &nbsp;&nbsp; <a href="#" title="Click here to get estimated travel time." id="1-179X" onclick="javascript:TollPlazaTrafficTime(213,this);">ET</a>
</td>
</tr>

Now I am scraping so result is coming like

Sr No.  Toll Plaza  Car/Jeep/Van(Rs.)
1   Kherki Daula    60 (Live Traffic)    -    ET
2   Shahjahanpur    125 (Live Traffic)    -    ET
                 Total Charges(Rs.) 90

I want to skip the text (Live Traffic - ET) from rows

my python code is

tbody = soup('table' ,{"class":"tollinfotbl"})[0].find_all('tr')[3:]
for row in tbody:
    cols = row.findChildren(recursive=False)
    cols = [ele.text.contents[0] for ele in cols]
    if cols:
        sno = str(cols[0])
        Toll_plaza = str(cols[1])
        cost = str(cols[2])

        query = "INSERT INTO tryroute (sno,Toll_plaza, cost) VALUES (%s, %s, %s);"

when I am using .contents[0] I am getting an error cols = [ele.text.content[0] for ele in cols] AttributeError: 'str' object has no attribute 'content'

any help would be appreciated.

Upvotes: 2

Views: 1069

Answers (2)

chad
chad

Reputation: 838

You can use re to extract the data from the raw. You don't need to get content[] since that is prone to errors coz you're explicitly giving the index and not flexible.

Add import re at the top before copying the code below.

for row in tbody:
    cols = row.findChildren(recursive=False)
    cols = [ele.text for ele in cols]
    if cols:
        sno = str(cols[0])
        Toll_plaza = str(cols[1])
        cost_raw = str(cols[2])

        compiled = re.compile('^(\d+)\s*\(', flags=re.IGNORECASE | re.DOTALL)
        match = re.search(compiled, cost_raw)
        if match:
            cost = match.group(1)

        query = "INSERT INTO tryroute (sno,Toll_plaza, cost) VALUES (%s, %s, %s);"

Let me know if you need clarification.

Upvotes: 1

tirthbal
tirthbal

Reputation: 181

You are getting this error because you are trying to use "contents" on a str object i.e ele.text

ele.text # returns a string object (which in your case contains the whole text in that particular tag)

To get contents of the tag, you have to do like this

ele.contents # inside your list comprehension, this will return a list of all the children of that particular tag

Upvotes: 2

Related Questions