Hick
Hick

Reputation: 36404

How to use Beautiful Soup to extract info out of this in Python

<font class="detDesc">Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER002</a></font>

I need the Uploaded 10-29 18:50, Size 4.36 GiB and NLUPPER002 in two separate arrays. How do I do it?

Edit:

This is a part of a html page that has a lot of these html font tags with different values. I need a generic solution, if any using soup. Else, as suggested, I would look into regex.

Edit2:

I've a doubt along side this. If we use "class" as a key to traverse a soup, won't it class with the python keyword class and throw-up an error?

Upvotes: 1

Views: 470

Answers (2)

Shawn Chin
Shawn Chin

Reputation: 86864

soup = BeautifulSoup(your_data)
uploaded = []
link_data = []
for f in soup.findAll("font", {"class":"detDesc"}):
    uploaded.append(f.contents[0]) 
    link_data.append(f.a.contents[0])  

For example, using the following data:

your_data = """
<font class="detDesc">Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER002</a></font>
<div id="meow">test</div>
<font class="detDesc">Uploaded 10-26&nbsp;19:23, Size 1.16&nbsp;GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER003</a></font>
"""

running the code above gives you:

>>> print uploaded
[u'Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by ', u'Uploaded 10-26&nbsp;19:23, Size 1.16&nbsp;GiB, ULed by ']
>>> print link_data
[u'NLUPPER002', u'NLUPPER003']

To get the text in the exact form as you mentioned, you can either post-process the list or parse the data within the loop itself. For example:

>>> [",".join(x.split(",")[:2]).replace("&nbsp;", " ") for x in uploaded]
[u'Uploaded 10-29 18:50, Size 4.36 GiB', u'Uploaded 10-26 19:23, Size 1.16 GiB']

P.S. if you're a fan of list comprehension, the solution can be expresses as a one-liner:

output = [(f.contents[0], f.a.contents[0]) for f in soup.findAll("font", {"class":"detDesc"})]

This gives you:

>>> output  # list of tuples
[(u'Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by ', u'NLUPPER002'), (u'Uploaded 10-26&nbsp;19:23, Size 1.16&nbsp;GiB, ULed by ', u'NLUPPER003')]

>>> uploaded, link_data = zip(*output)  # split into two separate lists
>>> uploaded
(u'Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by ', u'Uploaded 10-26&nbsp;19:23, Size 1.16&nbsp;GiB, ULed by ')
>>> link_data
(u'NLUPPER002', u'NLUPPER003')

Upvotes: 2

Acorn
Acorn

Reputation: 50497

The expression you need to use to find the elements you're interested in depends on what is unique about those elements compared to other elements in the document. Therefore without the context of the element, it's difficult to help.

Are the elements you're interested in the only ones in the documents that are font elements and have a class of detDesc?

If so, here is a solution using lxml:

import lxml.html as lh

html = '''
<font class="detDesc">Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER002</a></font>
'''

tree = lh.fromstring(html)

results = []

# iterate over all elements in the document that have a class of "detDesc"
for el in tree.xpath("//font[@class='detDesc']"):

    # extract text from the font element
    first = el.text

    # extract text from the first <a> within the font element
    second = el.xpath("a")[0].text

    results.append((first, second))

print results

Result:

[(u'Uploaded 10-29\xa018:50, Size 4.36\xa0GiB, ULed by ', 'NLUPPER002')]

Upvotes: 1

Related Questions