user5498164
user5498164

Reputation:

Python: parsing only text from HTML using bs4 and RegEx

I am building a python3 web crawler/scraper using bs4. There are some parts which need Reg Ex. I only want to scrape text content. How should I parse something like this:

<p> This is blah blah
<a class="wordpresslink" href="https://wordpress.com/" rel="generator nofollow">WordPress.com</a>
<a href="http://www.whatever.com/"><span class="s1">Example</span></a>
Like blah blah
</p>

I want output:

This is blah blah WordPress.com Example Like blah blah

My Code so far:

import urllib.request
from bs4 import BeautifulSoup

u='https://en.wikipedia.org/wiki/Adivasi'
r=urllib.request.urlopen(u)
soup=BeautifulSoup(r.read(),'html.parser')

res = [i.text.replace('\n', ' ').strip() for i in soup.find_all('p')]
for p in res:
        print(p)

Upvotes: 2

Views: 299

Answers (1)

Avinash Raj
Avinash Raj

Reputation: 174706

Use BeautifulSoup parser for parsing html files.

>>> soup = BeautifulSoup(s)
>>> soup.find('p').text
u' This is blah blah\nWordPress.com\nExample\nLike blah blah\n'
>>> soup.find('p').text.replace('\n', ' ').strip()
u'This is blah blah WordPress.com Example Like blah blah'

If there are more then use find_all

[i.text.replace('\n', ' ').strip() for i in soup.find_all('p')]

Upvotes: 1

Related Questions