chip
chip

Reputation: 1789

Extracting data in table using BeautifulSoup

I'm scraping this page for my android app. I'd like to extract the data on the table of cities and area codes

Here's my code:

from bs4 import BeautifulSoup
import urllib2
import re

base_url = "http://www.howtocallabroad.com/taiwan/"
html_page = urllib2.urlopen(base_url)
soup = BeautifulSoup(html_page)
codes = soup.select("#codes tbody > tr > td")
for area_code in codes:
    # print td city and area code

I'd like to know what function in python or in BeautifulSoup to get the values from <td>value</td>

Sorry just an android dev learning to write python

Upvotes: 3

Views: 2503

Answers (1)

TerryA
TerryA

Reputation: 60024

You can use findAll(), along with a function which breaks up a list into chunks

>>> areatable = soup.find('table',{'id':'codes'})
>>> d = {}
>>> def chunks(l, n):
...     return [l[i:i+n] for i in range(0, len(l), n)]
>>> dict(chunks([i.text for i in areatable.findAll('td')], 2))
{u'Chunan': u'36', u'Penghu': u'69', u'Wufeng': u'4', u'Fengyuan': u'4', u'Kaohsiung': u'7', u'Changhua': u'47', u'Pingtung': u'8', u'Keelung': u'2', u'Hsinying': u'66', u'Chungli': u'34', u'Suao': u'39', u'Yuanlin': u'48', u'Yungching': u'48', u'Panchiao': u'2', u'Taipei': u'2', u'Tainan': u'62', u'Peikang': u'5', u'Taichung': u'4', u'Yungho': u'2', u'Hsinchu': u'35', u'Tsoying': u'7', u'Hualien': u'38', u'Lukang': u'47', u'Talin': u'5', u'Chiaochi': u'39', u'Fengshan': u'7', u'Sanchung': u'2', u'Tungkang': u'88', u'Taoyuan': u'33', u'Hukou': u'36'}

Explanation:

.find() finds a table with an id of codes. The chunks function is used to split up a list into evenly sized chunks.

As findAll returns a list, we use chunks on the list to create something like:

[[u'Changhua', u'47'], [u'Keelung', u'2'], etc]

i.text for i in... is used to get the text of each td tag, otherwise the <td> and </td> would remain.

Finally, dict() is called to convert the list of lists into a dictionary, which you can use to access the country's area code.

Upvotes: 4

Related Questions