Forge
Forge

Reputation: 1677

how to preserve links when scraping a table with beautiful soup and pandas

Scraping a web to get a table, using Beautiful soup and Pandas. One of the columns got some urls. When I pass html to pandas, href are lost.

is there any way of preserving the url link just for that column?

Example data (edited for better suit ral case):

  <html>
        <body>
          <table>
              <tr>
               <td>customer</td>
               <td>country</td>
               <td>area</td>
               <td>website link</td>
             </tr>
             <tr>
               <td>IBM</td>
               <td>USA</td>
               <td>EMEA</td>
               <td><a href="http://www.ibm.com">IBM site</a></td>
            </tr>
          <tr>
            <td>CISCO</td>
            <td>USA</td>
            <td>EMEA</td>
            <td><a href="http://www.cisco.com">cisco site</a></td>
         </tr>
           <tr>
            <td>unknown company</td>
            <td>USA</td>
            <td>EMEA</td>
            <td></td>
         </tr>
       </table>
     </body>
  </html>

My python code:

    file = open(url,"r")

    soup = BeautifulSoup(file, 'lxml')

    parsed_table = soup.find_all('table')[1] 

    df = pd.read_html(str(parsed_table),encoding='utf-8')[0]

 df

Output (exported to CSV):

customer;country;area;website
IBM;USA;EMEA;IBM site
CISCO;USA;EMEA;cisco site
unknown company;USA;EMEA;

df output is ok but the link is lost. I need to preserve the link. The URL at least.

any hint?

Upvotes: 11

Views: 10319

Answers (3)

PythonMan
PythonMan

Reputation: 897

Here is another way to do it if you have more than one link to grab from html table. Instead of making list comprehension I would rather go with separeated for loops so code is more readable to those who are new to python and it is easier to adjust code ore handle errors if they emerge. I hope it will help someone.

soup = BeautifulSoup(html, "lxml")
table = table.find('table')
thead = table.find('thead')
column_names = [th.text.strip() for th in thead.find_all('th')]

data = []
for row in table.find_all('tr'):
    row_data = []
    for td in row.find_all('td'):
        td_check = td.find('a')
        if td_check is not None:
            link = td.a['href']
            row_data.append(link)
        else:
            not_link = ''.join(td.stripped_strings)
            if not_link == '':
                 not_link = None
            row_data.append(not_link)
    data.append(row_data)
df = pd.DataFrame(data[1:], columns=column_names)
df_dict = df.to_dict('records')

for row in df_dict:
    print(row)

Upvotes: 4

useRj
useRj

Reputation: 1312

Just check if tag exists this way:

 import numpy as np

 with open(url,"r") as f:
     sp = bs.BeautifulSoup(f, 'lxml')
     tb = sp.find_all('table')[56] 
     df = pd.read_html(str(tb),encoding='utf-8', header=0)[0]
     df['href'] = [np.where(tag.has_attr('href'),tag.get('href'),"no link") for tag in tb.find_all('a')]

Upvotes: 7

unutbu
unutbu

Reputation: 879251

pd.read_html assumes the data you are interested in is in the text, not the tag attributes. However, it isn't hard to scrape the table yourself:

import bs4 as bs
import pandas as pd

with open(url,"r") as f:
    soup = bs.BeautifulSoup(f, 'lxml')
    parsed_table = soup.find_all('table')[1] 
    data = [[td.a['href'] if td.find('a') else 
             ''.join(td.stripped_strings)
             for td in row.find_all('td')]
            for row in parsed_table.find_all('tr')]
    df = pd.DataFrame(data[1:], columns=data[0])
    print(df)  

yields

          customer country  area          website link
0              IBM     USA  EMEA    http://www.ibm.com
1            CISCO     USA  EMEA  http://www.cisco.com
2  unknown company     USA  EMEA                      

Upvotes: 16

Related Questions