find tags except those with attributes: BeautifulSoup

Question

In this page that I'm trying to scrape, i want to exclude those that has attributes.

Click here for a comprehensive area code list for Argentina

I'd like to know what function/s to use to exclude this tag with attributes

My code gets all cities and area codes

from bs4 import BeautifulSoup
import urllib2
import re

url = "http://www.howtocallabroad.com/argentina"
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page)

areatable = soup.find('table',{'id':'codes'})
if areatable is None:
    print "areatable is None"
else:
    d = {}

    def chunks(l, n):
            return [l[i : i + n] for i in range(0, len(l), n)]

    all_td = areatable.findAll('td')
    print all_td

    li = dict(chunks([i.text for i in all_td], 2))
    print li

But when I try to print li, it throws an exception:

Traceback (most recent call last):
  File "extract_table.py", line 21, in 
    li = dict(chunks([i.text for i in all_td], 2))
ValueError: dictionary update sequence element #30 has length 1; 2 is required

This is what i get when i call areatable.findAll('td')

[
Buenos Aires,
11,
La Rioja,
380,
Salta,
387,
Bahia Blanca,
291,
Mar del Plata,
223,
San Juan,
264,
Catamarca
,
383,
Mendoza,
261,
San Luis,
266,
Comodoro Rivadavia,
297,
Mercedes/Prov. B.A.,
2324,
San Nicolas,
336,
Concordia,
345,
Neuquen,
299,
San Rafael,
260,
Cordoba,
351,
Parana,
343,
Santa Fe,
342,
Corrientes,
379,
Posadas,
376,
Santiago del Estero,
385,
Formosa,
370,
Resistencia,
362,
Santo Tome,
3756,
Jesus Maria,
3525,
Rio Cuarto,
358,
Tandil,
249,
La Plata,
221,
Rosario,
341,
Trelew,
280,
Click here for a comprehensive area code list for Argentina
]

TerryA · Accepted Answer

The problem is that the all_td is an odd length, so the chunks function doesn't work perfectly. Here is a simple lambda function which finds out if tags have no attributes, which you can use to only catch the stuff tags:

>>> all_td = filter(lambda x: x.attrs == {}, all_td)
# all_td now contains [Buenos Aires, 11, La Rioja, 380, Salta, 387, Bahia Blanca, 291, Mar del Plata, 223, San Juan, 264, Catamarca
, 383, Mendoza, 261, San Luis, 266, Comodoro Rivadavia, 297, Mercedes/Prov. B.A., 2324, San Nicolas, 336, Concordia, 345, Neuquen, 299, San Rafael, 260, Cordoba, 351, Parana, 343, Santa Fe, 342, Corrientes, 379, Posadas, 376, Santiago del Estero, 385, Formosa, 370, Resistencia, 362, Santo Tome, 3756, Jesus Maria, 3525, Rio Cuarto, 358, Tandil, 249, La Plata, 221, Rosario, 341, Trelew, 280]

Simply, the lambda function will return True if the tag has no attributes. What filter() does is goes through each element in all_td, and runs the lambda function with each element. If the lambda function returns False with a given tag, it is removed from the list. A new list is returned.

Now when calling chunks, there will be an even amount of elements in the list, so no error should appear.

find tags except those with attributes: BeautifulSoup

Answers (1)

Related Questions