Reputation: 41
I am trying to extract link titles to between two bolded tags on an HTML page using Python/Beautiful Soup.
The HTML snippet of what I am trying to extract is as follows:
<B>Heading Title 1:</B> <a href="link1">Title1</a>
<a href="link2">Title2</a>
<B>Heading Title 2:</B> <a href="link3">Title3</a>
<a href="link4">Title4</a>
<a href="link5">Title5</a>
...
I am specifically looking to concatenate Title1 and Title2 (separated by a delimiter) to one entry in a list-like object, likewise for Title 3, Title 4, and Title 5, and so on. (One issue I foresee is that the number of titles is not set the same between each Heading Title.)
I've tried various approaches, including:
import requests, bs4, csv
res = requests.get('WEBSITE.html')
soup = BeautifulSoup(res.text, 'html.parser')
soupy4 = soup.select('a')
with open('output.csv', 'w') as f:
writer = csv.writer(f, delimiter=',', lineterminator='\n')
for line in soupy4:
if 'common_element_link' in line['href']:
categories.append(line.next_element)
writer.writerow([categories])
However, while this writes all titles to a file, it does so by directly appending each additional title like so:
['Title1']
['Title1', 'Title2']
['Title1', 'Title2', 'Title3']
['Title1', 'Title2', 'Title3', 'Title4']
...
Ideally, I want this code to do the following:
['Title1', 'Title2']
['Title3', 'Title4', 'Title5']
...
I am very much a newbie in regards to python lists and programming in general and am at a loss for how to proceed. I would appreciate any and all feedback anyone may have regarding this.
Thank you!
Upvotes: 4
Views: 608
Reputation: 71471
You can use itertools.groupby
to combine all link text between headings:
import itertools, re
from bs4 import BeautifulSoup as soup
d = [[i.name, i] for i in soup(content, 'html.parser').find_all(re.compile('b|a'))]
new_d = [[a, list(b)] for a, b in itertools.groupby(d, key=lambda x:x[0] == 'b')]
final_result = [[c.text for _, c in b] for a, b in new_d if not a]
Output:
[['Title1', 'Title2'], ['Title3', 'Title4', 'Title5']]
The original find_all
call works as a "flattener" and creates a list of lists with the target tag names and content. itertools.groupby
has a key that groups based on whether the tag name is for bold content. Thus, a final pass can be made over new_d
, ignoring b
groups, and extracting the text from the links.
Upvotes: 3
Reputation: 84475
You could use nth-of-type
, :not
pseudo class with general sibling ~
combinator. As the a
tags are all siblings, I believe, in shown html, I use the b
tags with nth-of-type to split the a
tags between into blocks. I use the :not
to remove later a
siblings from current.
from bs4 import BeautifulSoup as bs
html = '''
<B>Heading Title 1:</B> <a href="link1">Title1</a>
<a href="link2">Title2</a>
<B>Heading Title 2:</B> <a href="link3">Title3</a>
<a href="link4">Title4</a>
<a href="link5">Title5</a>
'''
soup = bs(html, 'lxml')
items = soup.select('b:has(~a)')
length = len(items)
if length == 1:
row = [item.text for item in soup.select('b ~ a')]
print(row)
elif length > 1:
for i in range(1, length + 1):
row = [item.text for item in soup.select('b:nth-of-type(' + str(i) + ') ~ a:not(b:nth-of-type(' + str(i + 1) + ') ~ a)')]
print(row)
output:
Upvotes: 3
Reputation: 164
your issue is your looping through all 'a' tags without any pattern algorithims, is it every 3 links u want to concatenate? u can put a for loop inside then:
for line in alllinks:
maintitle=''
for i in xrange(3):
maintitle+=line.text
mainlist.append(maintitle)
look for parent blocks, then loop through nested children
sp=sp.find('div',id='whatever')
a=sp.select('a') (this is recursive, finds all a tags in that div)
for tag in a:
title=a.text.strip()
url=a['href']
i recommend looking for parent html tags to your 'links' you want to group together, instead of doing it abritrarily by the order of all links
p.s. you can also make find() to be recusive though not recommended by using recursive=True option
adding strings together: str3=str1+str2
llist=[]
for z in zrange(10)
llist.append('bob'+str(z))
each list item has an index
print llist[1]
read up on lists,strings,dictionaries
Upvotes: 2