Reputation: 36404
<font class="detDesc">Uploaded 10-29 18:50, Size 4.36 GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER002</a></font>
I need the Uploaded 10-29 18:50, Size 4.36 GiB and NLUPPER002 in two separate arrays. How do I do it?
Edit:
This is a part of a html page that has a lot of these html font tags with different values. I need a generic solution, if any using soup. Else, as suggested, I would look into regex.
Edit2:
I've a doubt along side this. If we use "class" as a key to traverse a soup, won't it class with the python keyword class and throw-up an error?
Upvotes: 1
Views: 470
Reputation: 86864
soup = BeautifulSoup(your_data)
uploaded = []
link_data = []
for f in soup.findAll("font", {"class":"detDesc"}):
uploaded.append(f.contents[0])
link_data.append(f.a.contents[0])
For example, using the following data:
your_data = """
<font class="detDesc">Uploaded 10-29 18:50, Size 4.36 GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER002</a></font>
<div id="meow">test</div>
<font class="detDesc">Uploaded 10-26 19:23, Size 1.16 GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER003</a></font>
"""
running the code above gives you:
>>> print uploaded
[u'Uploaded 10-29 18:50, Size 4.36 GiB, ULed by ', u'Uploaded 10-26 19:23, Size 1.16 GiB, ULed by ']
>>> print link_data
[u'NLUPPER002', u'NLUPPER003']
To get the text in the exact form as you mentioned, you can either post-process the list or parse the data within the loop itself. For example:
>>> [",".join(x.split(",")[:2]).replace(" ", " ") for x in uploaded]
[u'Uploaded 10-29 18:50, Size 4.36 GiB', u'Uploaded 10-26 19:23, Size 1.16 GiB']
P.S. if you're a fan of list comprehension, the solution can be expresses as a one-liner:
output = [(f.contents[0], f.a.contents[0]) for f in soup.findAll("font", {"class":"detDesc"})]
This gives you:
>>> output # list of tuples
[(u'Uploaded 10-29 18:50, Size 4.36 GiB, ULed by ', u'NLUPPER002'), (u'Uploaded 10-26 19:23, Size 1.16 GiB, ULed by ', u'NLUPPER003')]
>>> uploaded, link_data = zip(*output) # split into two separate lists
>>> uploaded
(u'Uploaded 10-29 18:50, Size 4.36 GiB, ULed by ', u'Uploaded 10-26 19:23, Size 1.16 GiB, ULed by ')
>>> link_data
(u'NLUPPER002', u'NLUPPER003')
Upvotes: 2
Reputation: 50497
The expression you need to use to find the elements you're interested in depends on what is unique about those elements compared to other elements in the document. Therefore without the context of the element, it's difficult to help.
Are the elements you're interested in the only ones in the documents that are font
elements and have a class of detDesc
?
If so, here is a solution using lxml
:
import lxml.html as lh
html = '''
<font class="detDesc">Uploaded 10-29 18:50, Size 4.36 GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER002</a></font>
'''
tree = lh.fromstring(html)
results = []
# iterate over all elements in the document that have a class of "detDesc"
for el in tree.xpath("//font[@class='detDesc']"):
# extract text from the font element
first = el.text
# extract text from the first <a> within the font element
second = el.xpath("a")[0].text
results.append((first, second))
print results
Result:
[(u'Uploaded 10-29\xa018:50, Size 4.36\xa0GiB, ULed by ', 'NLUPPER002')]
Upvotes: 1