Reputation: 1459
I am trying to programmatically download (open) data from a website using BeautifulSoup.
The website is using a php form where you need to submit input data and then outputs the resulting links apparently within this form.
My approach was as follows
Step 1: post form data via request
Step 2: parse resulting links via BeautifulSoup
However, it seems like this is not working / I am doing wrong as the post method seems not to work and Step 2 is not even possible as no results are available.
Here is my code:
from bs4 import BeautifulSoup
import requests
def get_text_link(soup):
'Returns list of links to individual legal texts'
ergebnisse = soup.findAll(attrs={"class":"einErgebnis"})
if ergebnisse:
links = [el.find("a",href=True).get("href") for el in ergebnisse]
else:
links = []
return links
url = "https://www.justiz.nrw.de/BS/nrwe2/index.php#solrNrwe"
# Post specific day to get one day of data
params ={'von':'01.01.2018',
'bis': '31.12.2018',
"absenden":"Suchen"}
response = requests.post(url,data=params)
content = response.content
soup = BeautifulSoup(content,"lxml")
resultlinks_to_parse = get_text_link(soup) # is always an empty list
# proceed from here....
Can someone tell what I am doing wrong. I am not really familiar with request post. The form field for "bis" e.g. looks as follows:
<input id="bis" type="text" name="bis" size="10" value="">
If my approach is flawed I would appreaciate any hint how to deal with this kind of site.
Thanks!
Upvotes: 5
Views: 13085
Reputation: 2445
I've found what is the issue in your requests.
My investigation give the following params was availables:
gerichtst:
yp:
gerichtsbarkeit:
gerichtsort:
entscheidungsart:
date:
von: 01.01.2018
bis: 31.12.2018
validFrom:
von2:
bis2:
aktenzeichen:
schlagwoerter:
q:
method: stem
qSize: 10
sortieren_nach: relevanz
absenden: Suchen
advanced_search: true
I think the qsize
param is mandatory for yourPOST
request
So, you have to replace your params by:
params = {
'von':'01.01.2018',
'bis': '31.12.2018',
'absenden': 'Suchen',
'qSize': 10
}
Doing this, here are my results when I print resultlinks_to_parse
print(resultlinks_to_parse)
OUTPUT:
[
'http://www.justiz.nrw.de/nrwe/lgs/detmold/lg_detmold/j2018/03_S_69_18_Urteil_20181031.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/10_Sa_1122_17_Urteil_20180126.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/13_TaBV_10_18_Beschluss_20181123.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/10_Sa_1810_17_Urteil_20180629.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/10_Sa_1811_17_Urteil_20180629.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/11_Sa_1196_17_Urteil_20180118.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/11_Sa_1775_17_Urteil_20180614.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/11_SaGa_9_18_Urteil_20180712.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/12_Sa_748_18_Urteil_20181009.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/12_Sa_755_18_Urteil_20181106.html'
]
Upvotes: 3