loop_de_loop
loop_de_loop

Reputation: 41

How do I use Python/Beautiful Soup to extract text between two different tags?

I am trying to extract link titles to between two bolded tags on an HTML page using Python/Beautiful Soup.

The HTML snippet of what I am trying to extract is as follows:

<B>Heading Title 1:</B>&nbsp;<a href="link1">Title1</a>&nbsp;
<a href="link2">Title2</a>&nbsp;

&nbsp;

<B>Heading Title 2:</B>&nbsp;<a href="link3">Title3</a>&nbsp;
<a href="link4">Title4</a>&nbsp;
<a href="link5">Title5</a>&nbsp;

...

I am specifically looking to concatenate Title1 and Title2 (separated by a delimiter) to one entry in a list-like object, likewise for Title 3, Title 4, and Title 5, and so on. (One issue I foresee is that the number of titles is not set the same between each Heading Title.)

I've tried various approaches, including:

import requests, bs4, csv

res = requests.get('WEBSITE.html')

soup = BeautifulSoup(res.text, 'html.parser')

soupy4 = soup.select('a')

with open('output.csv', 'w') as f:
    writer = csv.writer(f, delimiter=',', lineterminator='\n')
    for line in soupy4:
        if 'common_element_link' in line['href']:
            categories.append(line.next_element)
            writer.writerow([categories])

However, while this writes all titles to a file, it does so by directly appending each additional title like so:

['Title1']
['Title1', 'Title2']
['Title1', 'Title2', 'Title3']
['Title1', 'Title2', 'Title3', 'Title4']
...

Ideally, I want this code to do the following:

['Title1', 'Title2']
['Title3', 'Title4', 'Title5']
...

I am very much a newbie in regards to python lists and programming in general and am at a loss for how to proceed. I would appreciate any and all feedback anyone may have regarding this.

Thank you!

Upvotes: 4

Views: 608

Answers (3)

Ajax1234
Ajax1234

Reputation: 71471

You can use itertools.groupby to combine all link text between headings:

import itertools, re
from bs4 import BeautifulSoup as soup
d = [[i.name, i] for i in soup(content, 'html.parser').find_all(re.compile('b|a'))]
new_d = [[a, list(b)] for a, b in itertools.groupby(d, key=lambda x:x[0] == 'b')]
final_result = [[c.text for _, c in b] for a, b in new_d if not a]

Output:

[['Title1', 'Title2'], ['Title3', 'Title4', 'Title5']]

The original find_all call works as a "flattener" and creates a list of lists with the target tag names and content. itertools.groupby has a key that groups based on whether the tag name is for bold content. Thus, a final pass can be made over new_d, ignoring b groups, and extracting the text from the links.

Upvotes: 3

QHarr
QHarr

Reputation: 84475

You could use nth-of-type, :not pseudo class with general sibling ~ combinator. As the a tags are all siblings, I believe, in shown html, I use the b tags with nth-of-type to split the a tags between into blocks. I use the :not to remove later a siblings from current.

from bs4 import BeautifulSoup as bs

html = '''
<B>Heading Title 1:</B>&nbsp;<a href="link1">Title1</a>&nbsp;
<a href="link2">Title2</a>&nbsp;

&nbsp;

<B>Heading Title 2:</B>&nbsp;<a href="link3">Title3</a>&nbsp;
<a href="link4">Title4</a>&nbsp;
<a href="link5">Title5</a>&nbsp;
'''
soup = bs(html, 'lxml')
items = soup.select('b:has(~a)')
length = len(items)
if length == 1:
    row = [item.text for item in soup.select('b ~ a')]
    print(row)
elif length > 1:
    for i in range(1, length + 1):
        row = [item.text for item in soup.select('b:nth-of-type(' + str(i) + ') ~ a:not(b:nth-of-type(' + str(i + 1) + ') ~ a)')]
        print(row)

output:

enter image description here

Upvotes: 3

Edo Edo
Edo Edo

Reputation: 164

your issue is your looping through all 'a' tags without any pattern algorithims, is it every 3 links u want to concatenate? u can put a for loop inside then:

for line in alllinks:
    maintitle=''
    for i in xrange(3):
       maintitle+=line.text
    mainlist.append(maintitle)

look for parent blocks, then loop through nested children

sp=sp.find('div',id='whatever')
a=sp.select('a')  (this is recursive, finds all a tags in that div)
for tag in a:
    title=a.text.strip()
    url=a['href']

i recommend looking for parent html tags to your 'links' you want to group together, instead of doing it abritrarily by the order of all links

p.s. you can also make find() to be recusive though not recommended by using recursive=True option

adding strings together: str3=str1+str2

llist=[]
for z in zrange(10)
   llist.append('bob'+str(z))

each list item has an index

print llist[1]

read up on lists,strings,dictionaries

Upvotes: 2

Related Questions