Karan
Karan

Reputation: 498

How to get all the text of the website without a source code?

Is there any way to get all the text of a website without the source code?

Like: Opening a website and ctrl + a everything there.

import requests

content = requests.get('any url')
print(content.text)

This prints the source code in a text form but I want to achieve that with the above?

Upvotes: 0

Views: 2286

Answers (2)

Arun Soorya
Arun Soorya

Reputation: 484

Step 1: Get some HTML from a web page.

Step 2: Use Beautiful Soup package to parse the HTML (Learn about Beautiful Soup if you don't have prior knowledge 'https://pypi.org/project/beautifulsoup4/').

Step 3: List the elements that are not required (e.g. header, meta, script).

import requests
from bs4 import BeautifulSoup
url = 'https://www.zzz.com/yyy/' #give any url
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
output = ''
blacklist = [
    '[document]',
    'noscript',
    'header',
    'html',
    'meta',
    'head', 
    'input',
    'script',
# name more elements if not required
]
for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)

print(output)

Upvotes: 1

ph140
ph140

Reputation: 475

For this you have to install beautifulsoup and lxml, but it will work after that.

from bs4 import BeautifulSoup
import requests

source = requests.get('your_url').text
soup = BeautifulSoup(source, 'lxml').text
print(soup)

Upvotes: 0

Related Questions