tardos93
tardos93

Reputation: 233

Python 2.7 web-scraping from a LOG IN website

I want to web-scraping from a https website, where i have to login to get the informations.

Here my (first part of) code:

import requests
from lxml import html
import urllib2
from bs4 import BeautifulSoup
import unicodecsv as csv
import os
import sys
import io
import time
import datetime
import pandas as pd
import MySQLdb

url = 'https://www.opten.hu/'
values = {'user': 'MYUSERNAME',
          'password': 'MYPASSWORD'}

r = requests.post(url, data=values)

params = {'Category': 6, 'deltreeid': 6, 'do': 'Delete Tree'}
url = 'https://www.opten.hu/cegtar/cegkivonat/0910000511'

result = requests.get(url, data=params, cookies=r.cookies)

print result

If I run it, and print the result i get "Response [200]", so its oke, the server successfully answered the http request.

After i want to navigate an other menu item on this website, where i can find the valuable informations for me. (called url)

How can i scrape this page, what i am wrong in my code?

import requests
from lxml import html
import urllib2
from bs4 import BeautifulSoup
import unicodecsv as csv
import os
import sys
import io
import time
import datetime
import pandas as pd
import MySQLdb

url = 'https://www.opten.hu/'
values = {'user': 'MYUSERNAME',
          'password': 'MYPASSWORD'}

r = requests.post(url, data=values)

params = {'Category': 6, 'deltreeid': 6, 'do': 'Delete Tree'}
url = 'https://www.opten.hu/cegtar/cegkivonat/0910000511'

result = requests.get(url, data=params, cookies=r.cookies)

print result

page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

print soup

Upvotes: 2

Views: 725

Answers (1)

swapnilsm
swapnilsm

Reputation: 309

You are using urllib2 to read the content. It will make another request to the url to get the data but will not use the cookies you got in the previous request.

Try the following code. I have used requests.Session to persist the cookies and you don't need urllib2 now.

# Author: Swapnil Mahajan
import requests
from lxml import html
import urllib2
from bs4 import BeautifulSoup
import unicodecsv as csv
import os
import sys
import io
import time
import datetime
import pandas as pd
import MySQLdb

url = 'https://www.opten.hu/ousers/loginuser'
values = {'user': 'MYUSERNAME',
          'password': 'MYPASSWORD'}

session = requests.Session()

r = session.post(url, data=values)

params = {'Category': 6, 'deltreeid': 6, 'do': 'Delete Tree'}
url = 'https://www.opten.hu/cegtar/cegkivonat/0910000511'

result = session.get(url, data=params)

soup = BeautifulSoup(result.text, "lxml")
print soup

Upvotes: 1

Related Questions