Reputation: 483
I am trying to scrape a website. There is no problem if there is only one opening and one closing form-Tag and data is in between that. But when the data on the website is displayed under checked box, then data in the codes is in strange position. Does anybody have the same problem?
Here is a basic example Webpage where I want the data:
<div class="label"></div>
<input disabled="" type="checkbox" name="t_pow_ports:f_p_a:3486" class="forminput" id="ajaxField-76" checked="">
Airport
<div class="label"></div>
<input disabled="" type="checkbox" name="t_pow_ports:f_p_b:3486" checked="" class="forminput" id="ajaxField-77">
Bunkers
<div class="label"></div>
<input disabled="" type="checkbox" name="t_pow_ports:f_p_c:3486" class="forminput" id="ajaxField-78">
Containers
<div class="label"></div>
<input disabled="" type="checkbox" name="t_pow_ports:f_p_l:3486" class="forminput" id="ajaxField-79">
Cruise
<div class="label"></div>
....
I need to fetch the data: Airport,Bunkers, etc(data) which have 'checked =""' in their input array. 1st Problem: To make sure I only get checked value 2nd Problem: How to fetch the data which is between
<div>..</div><input...> data <div>...</div>
By using the following code:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas
r = requests.get("http://directories.lloydslist.com/?p=1635")
c = r.content
soup = BeautifulSoup(c, "html.parser")
print(soup.prettify())
all = soup.find_all("div",{"id":"section-1785-body"},{"class":"sectionbody"})
I get the following format:
<div class="label"></div>
<input checked="" class="forminput" disabled="" id="ajaxField-115" name="t_pow_ports:f_p_a:5779" type="checkbox"/>
Airport
<div class="label"></div>
<input checked="" class="forminput" disabled="" id="ajaxField-116" name="t_pow_ports:f_p_b:5779" type="checkbox"/>
Bunkers
<div class="label"></div>
.....
....
<input checked="" class="forminput" disabled="" id="ajaxField-119" name="t_pow_ports:f_p_y:5779" type="checkbox"/> Dry Bulk
<div class="label"></div></div>
So if I use the following code:
abc = all[0].find_all("input", {"class":"forminput"},"checked")
I don't get any data:
<input class="forminput" disabled="" id="ajaxField-20" name="t_pow_ports:f_p_a:595" type="checkbox"/>,
<input class="forminput" disabled="" id="ajaxField-21" name="t_pow_ports:f_p_b:595" type="checkbox"/>,
<input class="forminput" disabled="" id="ajaxField-22" name="t_pow_ports:f_p_c:595" type="checkbox"/>,
....
Does anyone know a way around this problem?
Upvotes: 1
Views: 1789
Reputation: 2568
You need to use navigableString for getting the next sibling after the checked input.
Try the following:
from bs4 import BeautifulSoup as Soup
html_str = """
<div>
<div class="label"></div>
<input disabled="" type="checkbox" name="t_pow_ports:f_p_a:3486" class="forminput" id="ajaxField-76" checked=""/>
Airport
<div class="label"></div>
<input disabled="" type="checkbox" name="t_pow_ports:f_p_b:3486" checked="" class="forminput" id="ajaxField-77"/>
Bunkers
<div class="label"></div>
<input disabled="" type="checkbox" name="t_pow_ports:f_p_c:3486" class="forminput" id="ajaxField-78"/>
Containers
<div class="label"></div>
<input disabled="" type="checkbox" name="t_pow_ports:f_p_l:3486" class="forminput" id="ajaxField-79"/>
Cruise
<div class="label"></div>
</div>
"""
soup = Soup(html_str, "html.parser")
forminput = soup.find_all("input", {"class":"forminput"})
for item in forminput:
if item.get('checked') is not None:
# now work with navigable string! be careful for empty lines
name = item.next_sibling.strip()
print(name)
The output of this snippet is:
Airport
Bunkers
Upvotes: 1