Venkateshwaran Selvaraj
Venkateshwaran Selvaraj

Reputation: 1785

Scrape table without no class BS4 python

I have the following code trying to scrape data from a table which has no class from a webpage having many other unnecessary tables.

from bs4 import BeautifulSoup
import urllib2
import re
wiki = "http://www.maditssia.com/members/list.php?p=1&id=Engineering%20Industries"
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
title = ""
address = ""
contact = ""
phone = ""
description=""
email=""
table = soup.find("table")
#print table.text
#print re.sub(r'\s+',' ',''.join(table.text).encode('utf-8'))
for row in table.findAll("tr"):
   cells = row.findAll("td")
   if len(cells) >= 7:
    title = cells[0].find(text=True)
    address = cells[1].find(text=True)
    contact = cells[2].find(text=True)
    phone = cells[3].find(text=True)
    email= cells[4].find(text=True)
    description= cells[5].find(text=True)
    data = title + "," + address + "," + contact + "," + phone + "\n"
    print data

I am trying to iterate through table rows and cells to split them and save each td array so that my output is not messed up. Right now the following code does not display any data. HTML structure of the webpage is being quiet difficult for me to parse through.

My desired output is

Accurate Engineers | S.BALASUNDAR | Kadir Complex,4/153-1,Thilagar St. Melamadai Main Roa+D5d, D37Thasildhar Nagar,Madurai - 625 020 | 2520049,RE:2534603,98652-40049   | [email protected] | Mfg.and Export of Machine for Mfg of Match Box   

Upvotes: 1

Views: 3029

Answers (1)

Birei
Birei

Reputation: 36262

The information you want to extract is under <table> elements with a class attribute with value tableborder, so you can to begin the search there. Then use css selectors to choose each <tr> and <td> where it's located the data you want to extract.

An example with python3:

from bs4 import BeautifulSoup
import urllib.request as urllib2

wiki = "http://www.maditssia.com/members/list.php?p=1&id=Engineering%20Industries"
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

for table in soup.find_all('table', attrs={'class': 'tableborder'}):
    data = []
    data.append(table.select('tr:nth-of-type(1) > td:nth-of-type(2)')[0].string or '') 
    data.append(table.select('tr:nth-of-type(1) > td:nth-of-type(4)')[0].string or '') 
    data.append(table.select('tr:nth-of-type(2) > td:nth-of-type(2)')[0].string or '') 
    data.append(table.select('tr:nth-of-type(2) > td:nth-of-type(6)')[0].string or '') 
    data.append(table.select('tr:nth-of-type(3) > td:nth-of-type(6)')[0].string or '') 
    data.append(table.select('tr:nth-of-type(4) > td:nth-of-type(2)')[0].string or '') 
    print(' | '.join(data))

Run it like:

python3 script.py

That yields:

Aali Industries | A.Yobu | 1, Ayyanar Koil 4th Street Sellur Madurai - 625 002 | 2530132,9345204255,9345204256 | [email protected] | 
Accurate Engineers, | S.BALASUNDAR | Kadir Complex,4/153-1,Thilagar St. Melamadai Main Road, Thasildhar Nagar | 2520049,RE:2534603,98652-40049 | [email protected] | 
Akber Ali Industries | S. Abuthalif | 73/15/1, East Anna Thoppu Street Madurai - 625 001  | 2343526,2341100 |  | 
Alagu Wire Products | Rm.Meiyappan | 193/4-A, Trichy Road Pudukottai - 2  | 236624,98424-44624,98428-44624 |  | 
Allwin Fasetners | S.Joseph Vasudevan | XXXXX6,Parasakthi Nagar XXXXXXXXAvaniapuram XXXXXXXMadurai - 625 012. | 2670577 |  | 
Allwinraj Metals Centre, | P.SELVARAJ NADAR | 180 and 181 East Veli Street, Madurai - 625 001.  | 2622181, RES:2626914 |  | 
Amirtham Engineering Works | G.Amirtharaj | 7A,Govindan Chetty Street, Simmakkal, Madurai - 625 001. | 2622417 |  | 
Amudha Wire Products, | A.M.P. Shanmugavel | Swahath Residency C-1, 3rd Floor, 708-A, 17th East Street, Anna Nagar | 6534939,99449-55199 |  | 
Ananthasiva Engg.Works, | P.KALUVAN | 19-B New Ramnad Road, Madurai - 625 009.  | 2337434,RES:2532598,2311910 | 98421-22200 | 
Angalamman Industries, | Palanivel | 40/C, Chinnandan Koil Road, Near Angalamman Koil, Karur - 639 001. | NIL |  |

Upvotes: 4

Related Questions