bill999
bill999

Reputation: 2529

python - beautifulsoup - find variable amount of text in between tags

I am using python + beautifulsoup to parse html. My problem is that I have a variable amount of text items. In this case, for example, I want to extract 'Text 1', 'Text 2', ... 'Text 4'. In other webpages, there may be only 'Text 1' or possibly two, etc. So it changes. If the 'Text x's were contained in a tag, it would make my life easier. But they are not. I can access them using next and previous (or maybe nextSibling and previousSibling), but off the top of my head I don't know how to get all of them. The idea would be to (assuming the max. number I would ever encounter would be four) write the 'Text 1' to a file, then proceed all the way to 'Text 4'. That is in this case. In the case where there were only 'Text 1', I would write 'Text 1' to the file, and then just have blanks for 2-4. Any suggestions on what I should do?

<div id="DIVID" style="display: block; margin-left: 1em;">
  <b>Header 1</b>
  <br/>
  Text 1
  <br/>
  Text 2
  <br/>
  Text 3
  <br/>
  Text 4
 <br/>
 <b>Header 2</b>
</div>

While I'm at it, I have a not-so-related question. Say I have a website that has a variable number of links that all link to html exactly like what I have above. This is not what this application is, but think craigslist - there are a number of links on a central page. I need to be able to access each of these pages in order to do my parsing. What would be a good approach to do this?

Thanks!

extra: The next webpage might look like this:

<div id="DIVID2" style="display: block; margin-left: 1em;">
  <b>Header 1</b>
  <br/>
  Different Text 1
  <br/>
  Different Text 2
 <br/>
 <b>Header 2</b>
</div>

Note the differences:

  1. DIVID is now DIVID2. I can figure out the ending on DIVID based on other parsing on pages. This is not a problem.

  2. I only have two items of text instead of four.

  3. The text now is different.

Note the key similarity:

  1. Header 1 and Header 2 are the same. These don't change.

Upvotes: 0

Views: 3903

Answers (3)

justhalf
justhalf

Reputation: 9107

You can just combine everything using get_text:

test ="""<div id='DIVID'>
<b>Header 1</b>
<br/>
Text 1
<br/>
Text 2
<br/>
Text 3
<br/>
Text 4
<br/>
<b>Header 2</b>
</div>"""

def divid(tag):
    return tag.name=='div' and tag.has_attr('id') and tag['id'].startswith('DIVID')

soup = BeautifulSoup(test)
print soup.find(divid).get_text()

which will give you


Header 1

Text 1

Text 2

Text 3

Text 4

Header 2

Upvotes: 2

Vorsprung
Vorsprung

Reputation: 34327

Here is a different solution. nextSibling can get parts of the structured document that follow a named tag.

from BeautifulSoup import BeautifulSoup

text="""
<b>Header 1</b>
<br/>
Text 1
<br/>
Text 2
<br/>
Text 3
<br/>
Text 4
<br/>
<b>Header 2</b>
"""

soup = BeautifulSoup(text)

for br in soup.findAll('br'):
    following = br.nextSibling
    print following.strip()

Upvotes: 1

erewok
erewok

Reputation: 7835

You might try something like this:

>>> test ="""<b>Header 1</b>
<br/>
Text 1
<br/>
Text 2
<br/>
Text 3
<br/>
Text 4
<br/>
<b>Header 2</b>"""
>>> soup = BeautifulSoup(test)

>>> test = soup.find('b')
>>> desired_text = [x.strip() for x in str(test.parent).split('<br />')]
['<b>Header 1</b>', 'Text 1', 'Text 2', 'Text 3', 'Text 4', '<b>Header 2</b>']

Now you just need to separate by your 'Header' blocks, which I think is doable and I believe may get you started in the right direction.

As for your other question, you need to assemble a list of links and then iterate through them opening each one individually and processing how you will. This is a much broader question, though, so you should attempt some stuff and come back to refine what you have and ask a new question once you need some help on a specific issue.


Explanation on last line of code:

[x.strip() for x in str(test.parent).split('<br />')]

This takes my "test" node that I assigned above and grabs the parent. By turning into a string, I can "split" on the <br> tags, which makes those tags disappear and separates all the text we want separated. This creates a list where each list-item has the text we want and some '\n's.

Finally, what is probably most confusing is the list comprehension syntax, which looks like this:

some_list = [item for item in some_iterable]

This simply produces a list of "item"s all taken from "some_iterable". In my list comprehension, I'm running through the list, taking each item in the list, and simply stripping off a newline (the x.strip() part). There are many ways to do this, by the way.

Upvotes: 1

Related Questions