jjyoh
jjyoh

Reputation: 446

Scrape Table HTML with beautifulSoup

I'm trying to scrape a website which has been built with tables. Here a link of a page's example: http://www.rc2.vd.ch/registres/hrcintapp-pub/companyReport.action?rcentId=5947621600000055031025&lang=FR&showHeader=false

My goal is to get the name and the last name : Lass Christian (screenshot below).

[![enter image description here][1]][1] [1]: https://i.sstatic.net/q3nMb.png

I've already scraped many websites but this one I have absolutly no idea how to proceed. There are only 'tables' without any ID/Class tags and I can't figure out where I'm supposed to start.

Here's an exemple of the HTML code :

<table border="1" cellpadding="1" cellspacing="0" width="100%">
            <tbody><tr bgcolor="#f0eef2">
                
                <th colspan="3">Associés, gérants et personnes ayant qualité pour signer</th>
            </tr>
            <tr bgcolor="#f0eef2">
                
                <th>
                    <a class="hoverable" onclick="document.forms[0].rcentId.value='5947621600000055031025';document.forms[0].lang.value='FR';document.forms[0].searchLang.value='FR';document.forms[0].order.value='N';document.forms[0].rad.value='N';document.forms[0].goToAdm.value='true';document.forms[0].showHeader.value=false;document.forms[0].submit();event.returnValue=false; return false;">
                        Nom et Prénoms, Origine, Domicile, Part sociale
                    </a>
                    
                </th>
                <th>
                    <a class="hoverable" onclick="document.forms[0].rcentId.value='5947621600000055031025';document.forms[0].lang.value='FR';document.forms[0].searchLang.value='FR';document.forms[0].order.value='F';document.forms[0].rad.value='N';document.forms[0].goToAdm.value='true';document.forms[0].showHeader.value=false;document.forms[0].submit();event.returnValue=false; return false;">
                        Fonctions
                    </a>
                    
                        <img src="/registres/hrcintapp-pub/img/down_r.png" align="bottom" border="0" alt="">
                    
                </th>
                <th>Mode Signature</th>
            </tr>
            
                <tr bgcolor="#ffffff">
                    
                    
                    <td>
                        <span style="text-decoration: none;">
                            Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
                        </span>
                    </td>
                    <td><span style="text-decoration: none;">associé gérant </span>&nbsp;</td>
                    
                    
                        <td><span style="text-decoration: none;">signature individuelle</span>&nbsp;</td>                   
                    
                    
                </tr>
            
            
            
            
        </tbody></table>

Upvotes: 1

Views: 943

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180542

This will get the name from the page, the table is right after the anchor with the id adm, once you have that you have numerous ways to get what you need:

from bs4 import BeautifulSoup
import requests

r = requests.get('http://www.rc2.vd.ch/registres/hrcintapp-pub/companyReport.action?rcentId=5947621600000055031025&lang=FR&showHeader=false')


soup = BeautifulSoup(r.content,"lxml")
table  = soup.select_one("#adm").find_next("table")
name = table.select_one("td span[style^=text-decoration:]").text.split(",", 1)[0].strip()
print(name)

Output:

Lass Christian

Or:

table = soup.select_one("#adm").find_next("table")
name = table.find("tr",bgcolor="#ffffff").td.span.text.split(",", 1)[0].strip()

Upvotes: 2

John
John

Reputation: 16007

Something like this?

results = soup.find_all("tr", {"bgcolor" : "#ffffff"})
for result in results:
    the_name = result.td.span.get_text().split(',')[0]

Upvotes: 0

Related Questions