Alg_D
Alg_D

Reputation: 2390

python - web scraping BeautifulSoup and urllib

I am using python 3.4 and my script looks like:

import urllib
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
from bs4 import BeautifulSoup

url = "http://www.embassy-worldwide.com/"

headers={'User-Agent': 'Mozilla/5.0'}
#req = Request(url, headers)

try:
    req = urllib.request.Request(url, headers)
    #print (req)
except HTTPError as e:
    print('Error code: ', e.code)
except URLError as e:
    print('Reason: ', e.reason)
else:
    print('good!')

print (req)

#html = urllib.request.urlopen(req)
with urllib.request.urlopen(req) as response:
    html = response.read()
print(html)

the code above results in an error:

ValueError: Content-Length should be specified for iterable data of type {'User-Agent': 'Mozilla/5.0'}

How can I get the html code and then iterate the tags to get a list with all countries?

Upvotes: 0

Views: 1345

Answers (1)

user1467267
user1467267

Reputation:

Try this style in urllib3:

import sys
import re
import time
import pprint
import codecs
import unicodedata
import urllib3
import json

urllib3.disable_warnings()

cookie = '_session_id=29913b5f1b8836d2a8387ef4db00745e'
header = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/536.26.17 (KHTML, like Gecko) Version/6.0.2 Safari/536.26.17'
url = 'https://yoururl.com/'
m = urllib3.PoolManager(num_pools = 15)

r = m.request('GET', url, None, {'User-Agent' : header, 'Cookie' : cookie})

print(r.data)

The imports are more than needed. It's just a snippet from a bigger part of a scraper I use. And mine uses some regex because the tiny snippets I need are in my case faster in regex than a full beautifulsoup implementation.

Upvotes: 2

Related Questions