Vivek K. Singh
Vivek K. Singh

Reputation: 170

web scraping with python requests post request

I am trying to scrape this website which gives the license number information of submitting the form. I am trying to simulate the POST request but everytime it sends response "No data found", I have used every header and payload but still nothing works I looked at the Network tab and tried every request header but still it is not working. I am completely lost.

Here is my code

import json
import requests
from requests import Session
session = Session()

headers = {
    "Accept":"*/*",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.9,hi;q=0.8",
    "Connection": "keep-alive",
    "Content-Length": "39",
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
    #"Cookie": "PHPSESSID=l98qcansg3shogj29o14mt1opi; _ga=GA1.2.905847747.1646590237; _gid=GA1.2.153711160.1646590237",
    #"DNT": "1",
    "Host": "vahaninfos.com",
    "Origin": "https://vahaninfos.com",
    "Referer": "https://vahaninfos.com/vehicle-details-by-number-plate",
    #"sec-ch-ua": """ " Not A;Brand";v="99", "Chromium";v="98", "Google Chrome";v="98" """,
    #"sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "Windows",
    "X-Requested-With": "XMLHttpRequest",


    #"Sec-Fetch-Mode" :"cors",
    #"Sec-Fetch-Dest": "empty",
    "num":"NDQ4MTg3MjMw",
    #"Sec-Fetch-Site":"same-origin"
    }
payload = {
    "number":"UP32AT5472",
    "g-recaptcha-response":""
    }

url = "https://vahaninfos.com/getdetails.php"
res = requests.post(url, data=json.dumps(payload), headers=headers, verify=False)
print(res.content)

If you want to try the website you can try inputting this UP32AT5471 and look at the output yourself

Network Tab on Chrome DevTools Network Tab Request Headers

Payload Payload

General Tab enter image description here Thanks in advance

Upvotes: 1

Views: 4915

Answers (1)

furas
furas

Reputation: 142651

This page sends cookies with PHPSESSIONID and in HTML it sends token like this

<script>token = "NDQ4MTg3MjMw"

and it uses JavaScript to get this value and add in headers

num: NDQ4MTg3MjMw,

And server needs PHPSESSIONID and num to send data.

Every connection creates new value in PHPSESSIONID and token - so you could hardcode some values in your code, but session ID can be valid only for a few minutes - and it is better to get fresh values from GET request before POST request.


So you have to use requests.Session to work with cookies and first send GET to https://vahaninfos.com/vehicle-details-by-number-plate to get cookie PHPSESSIONID and HTML with <script>token = "..."

Next you have to get this token from HTML - ie. using regex - and add it as header num: .... in POST request.


It seems other headers are not important - even X-Requested-With.

This page needs to send data as form so you need data=payload instead of data=json.load(payload). And it creates automatically headers Content-Type and Content-Length with correct values.

import requests
import re

session = requests.Session()

# --- GET ---

headers = {
#    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0",
}

url = "https://vahaninfos.com/vehicle-details-by-number-plate"
res = session.get(url, verify=False)

number = re.findall('token = "([^"]*)"', res.text)[0]

# --- POST ---

headers = {
#    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0",
#    "X-Requested-With": "XMLHttpRequest",
    'num': number,
}

payload = {
    "number": "UP32AT5472",
    "g-recaptcha-response": "",
}

url = "https://vahaninfos.com/getdetails.php"
res = session.post(url, data=payload, headers=headers, verify=False)
print(res.text)

Result:

<tr><td>Registration Number</td><td>:</td><td>UP32AT5472</td></tr>
        <tr><td>Registration Authority</td><td>:</td><td>LUCKNOW</td></tr>
        <tr><td>Registration Date</td><td>:</td><td>2003-06-06</td></tr>
        <tr><td>Chassis Number</td><td>:</td><td>487530</td></tr>
        <tr><td>Engine Number</td><td>:</td><td>490062</td></tr>
        <tr><td>Fuel Type</td><td>:</td><td>PETROL</td></tr>
        <tr><td>Engine Capacity</td><td>:</td><td></td></tr>
        <tr><td>Model/Model Name</td><td>:</td><td>TVS VICTOR</td></tr>
        <tr><td>Color</td><td>:</td><td></td></tr>
        <tr><td>Owner Name</td><td>:</td><td>HARI MOHAN  PANDEY</td></tr>
        <tr><td>Ownership Type</td><td>:</td><td></td></tr>
        <tr><td>Financer</td><td>:</td><td>CENTRAL BANK OF INDIA</td></tr>
        <tr><td>Vehicle Class</td><td>:</td><td>M-CYCLE/SCOOTER(2WN)</td></tr>
        <tr><td>Fitness/Regn Upto</td><td>:</td><td></td></tr>
        <tr><td>Insurance Company</td><td>:</td><td>NATIONAL INSURANCE CO LTD.</td></tr>
        <tr><td>Insurance Policy No</td><td>:</td><td>4165465465465</td></tr>
        <tr><td>Insurance expiry</td><td>:</td><td>2004-06-05</td></tr>
        <tr><td>Vehicle Age</td><td>:</td><td></td></tr>
        <tr><td>Vehicle Type</td><td>:</td><td></td></tr>
        <tr><td>Vehicle Category</td><td>:</td><td></td></tr>

Now you can use beautifulsoup or lxml (or other module) to get values from HTML.

from bs4 import BeautifulSoup

soup = BeautifulSoup(res.text, 'html.parser')

for row in soup.find_all('tr'):
    cols = row.find_all('td')

    key = cols[0].text
    val = cols[-1].text

    print(f'{key:22} | {val}')

Result:

Registration Number    | UP32AT5472
Registration Authority | LUCKNOW
Registration Date      | 2003-06-06
Chassis Number         | 487530
Engine Number          | 490062
Fuel Type              | PETROL
Engine Capacity        | 
Model/Model Name       | TVS VICTOR
Color                  | 
Owner Name             | HARI MOHAN  PANDEY
Ownership Type         | 
Financer               | CENTRAL BANK OF INDIA
Vehicle Class          | M-CYCLE/SCOOTER(2WN)
Fitness/Regn Upto      | 
Insurance Company      | NATIONAL INSURANCE CO LTD.
Insurance Policy No    | 4165465465465
Insurance expiry       | 2004-06-05
Vehicle Age            | 
Vehicle Type           | 
Vehicle Category       | 

EDIT:

After running code few times POST started sending me only values R - maybe it needs some other headers to hide bot (ie. User-Agent), or maybe sometimes it needs to send correct code for ReCaptcha.

At least in Chrome it stops sending R when I set ReCaptha.

But Firefox still send R.

Originally I was using User-Agent from my Firefox and it may remeber it.


EDIT:

If I use User-Agent different then my Firefox then code again gets correct values and Firefox still gets only R.

headers = {
    "User-Agent": "Mozilla/5.0",
}

So it seems code may need to use random User-Agent in every request to hide bot.

Upvotes: 5

Related Questions