Python: parsing only text from HTML using bs4 and RegEx

Question

I am building a python3 web crawler/scraper using bs4. There are some parts which need Reg Ex. I only want to scrape text content. How should I parse something like this:

 This is blah blah
WordPress.com
Example
Like blah blah

I want output:

This is blah blah WordPress.com Example Like blah blah

My Code so far:

import urllib.request
from bs4 import BeautifulSoup

u='https://en.wikipedia.org/wiki/Adivasi'
r=urllib.request.urlopen(u)
soup=BeautifulSoup(r.read(),'html.parser')

res = [i.text.replace('
', ' ').strip() for i in soup.find_all('p')]
for p in res:
        print(p)

Avinash Raj · Accepted Answer

Use BeautifulSoup parser for parsing html files.

>>> soup = BeautifulSoup(s)
>>> soup.find('p').text
u' This is blah blah
WordPress.com
Example
Like blah blah
'
>>> soup.find('p').text.replace('
', ' ').strip()
u'This is blah blah WordPress.com Example Like blah blah'

If there are more then use find_all

[i.text.replace('
', ' ').strip() for i in soup.find_all('p')]

Python: parsing only text from HTML using bs4 and RegEx

Answers (1)

Related Questions