Reputation: 3671
I am trying to access the article content from a website, using beautifulsoup with the below code:
site= 'www.example.com'
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
content = soup.find_all('p')
content=str(content)
the content object contains all of the main text from the page that is within the 'p' tag, however there are still other tags present within the output as can be seen in the image below. I would like to remove all characters that are enclosed in matching pairs of < > tags and the tags themselves. so that only the text remains.
I have tried the following method, but it does not seem to work.
' '.join(item for item in content.split() if not (item.startswith('<') and item.endswith('>')))
What is the best way to remove substrings in a sting? that begin and end with a certain pattern such as < >
Upvotes: 3
Views: 42788
Reputation: 51
Simple algorithm that will work in every language without modules and additional libs imported. Code is self-documented:
def removetags_fc(data_str):
appendingmode_bool = True
output_str = ''
for char_str in data_str:
if char_str == '>':
appendingmode_bool = False
elif char_str == '<':
appendingmode_bool = True
continue
if appendingmode_bool:
output_str += char_str
return output_str
For better realization literals '>' and '<' need to be instanced in memory one time before loop start.
Upvotes: 0
Reputation: 63
if you restricted to use any library you can simply use the below code for remove html tags.
i just correct what you tried. thanks for the idea
content="<h4 style='font-size: 11pt; color: rgb(67, 67, 67); font-family: arial, sans-serif;'>Sample text for display.</h4> <p> </p>"
' '.join([word for line in [item.strip() for item in content.replace('<',' <').replace('>','> ').split('>') if not (item.strip().startswith('<') or (item.strip().startswith('&') and item.strip().endswith(';')))] for word in line.split() if not (word.strip().startswith('<') or (word.strip().startswith('&') and word.strip().endswith(';')))])
Upvotes: 1
Reputation: 63782
Pyparsing makes it easy to write an HTML stripper by defining a pattern matching all opening and closing HTML tags, and then transforming the input using that pattern as a suppressor. This still leaves the &xxx;
HTML entities to be converted - you can use xml.sax.saxutils.unescape
to do that:
source = """
<p><strong>Editors' Pick: Originally published March 22.<br /> <br /> Apple</strong> <span class=" TICKERFLAT">(<a href="/quote/AAPL.html">AAPL</a> - <a href="http://secure2.thestreet.com/cap/prm.do?OID=028198&ticker=AAPL">Get Report</a><a class=" arrow" href="/quote/AAPL.html"><span class=" tickerChange" id="story_AAPL"></span></a>)</span> is waking up the echoes with the reintroduction of a 4-inch iPhone, a model its creators hope will lead the company to victory not just in emerging markets, but at home as well.</p>
<p>"There's significant pent-up demand within Apple's base of iPhone owners who want a smaller iPhone with up-to-date specs and newer features," Jackdaw Research Chief Analyst Jan Dawson said in e-mailed comments.</p>
<p>The new model, dubbed the iPhone SE, "should unleash a decent upgrade cycle over the coming months," Dawson said. Prior to the iPhone 6 and 6 Plus, introduced in 2014, Apple's iPhones were small, at 3.5 inches and 4 inches tall, respectively, compared with models by Samsung and others that approached 6 inches.</p>
<div class=" butonTextPromoAd">
<div class=" ym" id="ym_44444440"></div>"""
from pyparsing import anyOpenTag, anyCloseTag
from xml.sax.saxutils import unescape as unescape
unescape_xml_entities = lambda s: unescape(s, {"'": "'", """: '"', " ":" "})
stripper = (anyOpenTag | anyCloseTag).suppress()
print(unescape_xml_entities(stripper.transformString(source)))
gives:
Editors' Pick: Originally published March 22. Apple (AAPL - Get Report) is waking up the echoes with the reintroduction of a 4-inch iPhone, a model its creators hope will lead the company to victory not just in emerging markets, but at home as well.
"There's significant pent-up demand within Apple's base of iPhone owners who want a smaller iPhone with up-to-date specs and newer features," Jackdaw Research Chief Analyst Jan Dawson said in e-mailed comments.
The new model, dubbed the iPhone SE, "should unleash a decent upgrade cycle over the coming months," Dawson said. Prior to the iPhone 6 and 6 Plus, introduced in 2014, Apple's iPhones were small, at 3.5 inches and 4 inches tall, respectively, compared with models by Samsung and others that approached 6 inches.
(And in future, please do not provide sample text or code as non-copy-pasteable images.)
Upvotes: 1
Reputation: 28277
Using regEx:
re.sub('<[^<]+?>', '', text)
Using BeautifulSoup:(Solution from here)
import urllib
from bs4 import BeautifulSoup
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
Using NLTK:
import nltk
from urllib import urlopen
url = "https://stackoverflow.com/questions/tagged/python"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print(raw)
Upvotes: 26
Reputation: 12623
You could use get_text()
for i in content:
print i.get_text()
Example below is from the docs:
>>> markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
>>> soup = BeautifulSoup(markup)
>>> soup.get_text()
u'\nI linked to example.com\n'
Upvotes: 10
Reputation: 174708
You need to use the strings generator:
for text in content.strings:
print(text)
Upvotes: 1