Reputation: 587
I am scraping html that contains the <select>
tag inside a dropdown. I am looking for the selected item, i.e. the <option>
tag that has the selected
attribute.
As far as I understand, this code should do the trick:
soup_1 = bs(result_1.content,'lxml')
title = soup_1.find('select', {'id':'id_document'})
title2 = title.findAll('option')
for tit in title2:
print(tit)
if tit.has_attr('selected'):
print("found")
print(tit.getText())
Yet the list of <options>
is very long (the console shows >5000 records) and bs/lxml seems to scan only 29 of them.
Is there a way to have it scan them all, or a more efficient way to perform the search? I have searched a bit around, but other than vague similarity to old bugs, I could not find a reason nor a solution.
I also tried doing a find('select', {'selected':""})
but all records seem to satisfy the condition, even if only one effectively has that attribute inside the html. I could not either understand if searching through find
checks all the entries or faces the same limitation.
Thanks
Edit: Here's a portion/sample of the html I am trying to extract info from:
<select name="document" id="id_document" required>
<option value="">---------</option>
<option value="294">Tutorials | Inkscape</option>
<option value="241">Traduzione testo Mean - Taylor wift</option>
<option value="243">http://www.angularjsbook.com/angular-basics/chapters/basics/</option>
<option value="2521">script WLF 101 - Google Docs</option>
<option value="290">LyX wiki | Layouts / Layouts</option>
<option value="257">10Part2Chap7</option>
<option value="296">Inkscape tutorial: Advanced | Inkscape</option>
<option value="261">http://www.bankofengland.co.uk/banknotes/Pages/about/faqs.aspx</option>
<option value="273">Nuvolaverde - Home</option>
<option value="240">BLACK EYED PEAS LYRICS - Where Is The Love?</option>
<option value="2527">How to Start a Blog In The Most Cluttered Marketplace In History</option>
<option value="2528">3 Simple Steps to Silencing Your Inner Critic – Matthew E. May – Medium</option>
.... (some 5K more lines)
<option value="4082">Lietuva - Prancūzija Tiesiogiai. Rugsėjo 7 d. 15:00 val. | TVPlay</option>
<option value="4083">Google Calendar - settembre 2019</option>
<option value="4084">Google Calendar - settembre 2019</option>
<option value="4085" selected>Estructura de datos</option>
</select>
(Interesting discrepancy between the source code of the page, where the last line of the option
list has just the attribute selected
, and the console where the same attribute is shown as selected=""
)
Upvotes: 1
Views: 91
Reputation: 4763
We verified that the code works for the purposes of identifying the selected option, even with the high quantity of options, which were entered as a string for testing purposes.
from bs4 import BeautifulSoup
content =''' <String of text sent via pastebin here>
'''
soup_1 = BeautifulSoup(content, 'lxml')
title = soup_1.find('select', {'id': 'id_document'})
title2 = title.findAll('option')
for tit in title2:
if tit.has_attr('selected'):
print("found")
print(tit.getText())
In our chat, we determined that the problem is likely in scraping the tags from the website rather than the processing of the data. Anyone else who stumbles upon this should check that their request.content
or content
actually contains the information which they wish to scrape.
Upvotes: 1