1man
1man

Reputation: 5634

BeautifulSoup and Large html

I was trying to scrape a number of large Wikipedia pages like this one.

Unfortunately, BeautifulSoup is not able to work with such a large content, and it truncates the page.

Upvotes: 1

Views: 2459

Answers (2)

Cornel Ghiban
Cornel Ghiban

Reputation: 902

I suggest you get the html content and then pass it to BS:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://en.wikipedia.org/wiki/Talk:Game_theory')
if r.ok:
  soup = BeautifulSoup(r.content)
  # get the div with links at the bottom of the page
  links_div = soup.find('div', id='catlinks')
  for a in links_div.find_all('a'):
    print a.text
else:
  print r.status_code

Upvotes: 1

1man
1man

Reputation: 5634

I found a solution to this problem using BeautifulSoup at beautifulsoup-where-are-you-putting-my-html, because I think it is easier than lxml.

The only thing you need to do is to install:

pip install html5lib

and add it as a parameter to BeautifulSoup:

soup = BeautifulSoup(htmlContent, 'html5lib')

However, if you prefer, you can also use lxml as follows:

import lxml.html

doc = lxml.html.parse('https://en.wikipedia.org/wiki/Talk:Game_theory')

Upvotes: 2

Related Questions