Gokberk
Gokberk

Reputation: 65

HTML Specific <h1> Text in Python

I want to get only title of the page <h1>This is Title</h1> in python.

I tried some method but couldn't get desired result.

import requests

from bs4 import BeautifulSoup


response = requests.get("https://www.strawpoll.me/20321563/r")

html_content = response.content

soup = BeautifulSoup(html_content, "html.parser")

for i in soup.get_text("p", {"class": "result-list"}):
    print(i)

Upvotes: 0

Views: 1853

Answers (4)

Rpatel
Rpatel

Reputation: 11

You could use BeautifulSoup as see:

from bs4 import BeautifulSoup

data = "html as text(Source)"

soup = BeautifulSoup(data)

p = soup.find('h1', attrs={'class': 'titleClass'})
p.a.extract()
print p.text.strip()

Upvotes: 1

Ransaka Ravihara
Ransaka Ravihara

Reputation: 1994

Try this method if you are still couldn't get the result that you want.

import urllib
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.strawpoll.me/20321563/r'
uCLient = uReq(my_url)
page_html = uCLient.read()
uCLient.close()    
page_soup = soup(page_html,"html.parser")
_div = page_soup.find(lambda tag: tag.name=='div' and tag.has_attr('id') and 
tag['id']=="result-list") 
title = _div.findAll(lambda tag: tag.name=='h1')

print(title)

Output : [<h1>This is Title</h1>]

Upvotes: 0

Gokberk
Gokberk

Reputation: 65

I add given code to mine.

title = soup.title
print(title.string[:-24:])  # Last 24 character of title is always constant.

Upvotes: 0

Tim Anthony
Tim Anthony

Reputation: 159

Use lxml for such tasks. You could use beautifulsoup as well.

import lxml.html
t = lxml.html.parse(url)
print t.find(".//title").text

(This is from How can I retrieve the page title of a webpage using Python? by Peter Hoffmann)

Upvotes: 4

Related Questions