Reputation: 95
I'm trying to scrape a website to get some text. This is what i have performed.
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")
ans=soup.find_all("label")
print(ans)
And this is the output:
[<label for="q8086-1"><input id="q8086-1" name="q8086" onclick="check_answer('q8086correct','q8086incorrect','1','1');" type="radio"/>Japan
</label>,
<label for="q8086-2"><input id="q8086-2" name="q8086" onclick="check_answer('q8086correct','q8086incorrect','2','1');" type="radio"/>France
</label>,
<label for="q8086-3"><input id="q8086-3" name="q8086" onclick="check_answer('q8086correct','q8086incorrect','3','1');" type="radio"/>Germany
</label>,
<label for="q8086-4"><input id="q8086-4" name="q8086" onclick="check_answer('q8086correct','q8086incorrect','4','1');" type="radio"/>England</label>,
<label for="q8085-1"><input id="q8085-1" name="q8085" onclick="check_answer('q8085correct','q8085incorrect','1','2');" type="radio"/>PAL
</label>,
<label for="q8085-2"><input id="q8085-2" name="q8085" onclick="check_answer('q8085correct','q8085incorrect','2','2');" type="radio"/>NTSC
</label>,
<label for="q8085-3"><input id="q8085-3" name="q8085" onclick="check_answer('q8085correct','q8085incorrect','3','2');" type="radio"/>SECAM
</label>,
<label for="q8085-4"><input id="q8085-4" name="q8085" onclick="check_answer('q8085correct','q8085incorrect','4','2');" type="radio"/>RGB</label>,
<label for="q8082-1"><input id="q8082-1" name="q8082" onclick="check_answer('q8082correct','q8082incorrect','1','1');" type="radio"/>Neon Lighting
</label>,
<label for="q8082-2"><input id="q8082-2" name="q8082" onclick="check_answer('q8082correct','q8082incorrect','2','1');" type="radio"/>High Pressure Sodium Lighting
</label>,
<label for="q8082-3"><input id="q8082-3" name="q8082" onclick="check_answer('q8082correct','q8082incorrect','3','1');" type="radio"/>Water Features
</label>,
<label for="q8082-4"><input id="q8082-4" name="q8082" onclick="check_answer('q8082correct','q8082incorrect','4','1');" type="radio"/>Hotel Rooms</label>,
<label for="q8079-1"><input id="q8079-1" name="q8079" onclick="check_answer('q8079correct','q8079incorrect','1','2');" type="radio"/>New Zealand
</label>,
<label for="q8079-2"><input id="q8079-2" name="q8079" onclick="check_answer('q8079correct','q8079incorrect','2','2');" type="radio"/>Australia
</label>,
<label for="q8079-3"><input id="q8079-3" name="q8079" onclick="check_answer('q8079correct','q8079incorrect','3','2');" type="radio"/>Argentina
</label>,
<label for="q8079-4"><input id="q8079-4" name="q8079" onclick="check_answer('q8079correct','q8079incorrect','4','2');" type="radio"/>United Kingdom</label>,
<label for="q8078-1"><input id="q8078-1" name="q8078" onclick="check_answer('q8078correct','q8078incorrect','1','3');" type="radio"/>Federated States of Micronesia
</label>,
<label for="q8078-2"><input id="q8078-2" name="q8078" onclick="check_answer('q8078correct','q8078incorrect','2','3');" type="radio"/>Palau
</label>,
<label for="q8078-3"><input id="q8078-3" name="q8078" onclick="check_answer('q8078correct','q8078incorrect','3','3');" type="radio"/>Northern Mariana Islands
</label>,
<label for="q8078-4"><input id="q8078-4" name="q8078" onclick="check_answer('q8078correct','q8078incorrect','4','3');" type="radio"/>Guam</label>,
<label for="q8077-1"><input id="q8077-1" name="q8077" onclick="check_answer('q8077correct','q8077incorrect','1','4');" type="radio"/>Germany
</label>,
<label for="q8077-2"><input id="q8077-2" name="q8077" onclick="check_answer('q8077correct','q8077incorrect','2','4');" type="radio"/>United Kingdom
</label>,
<label for="q8077-3"><input id="q8077-3" name="q8077" onclick="check_answer('q8077correct','q8077incorrect','3','4');" type="radio"/>France
</label>,
<label for="q8077-4"><input id="q8077-4" name="q8077" onclick="check_answer('q8077correct','q8077incorrect','4','4');" type="radio"/>Japan</label>,
<label for="q8076-1"><input id="q8076-1" name="q8076" onclick="check_answer('q8076correct','q8076incorrect','1','1');" type="radio"/>Indonesia
</label>,
<label for="q8076-2"><input id="q8076-2" name="q8076" onclick="check_answer('q8076correct','q8076incorrect','2','1');" type="radio"/>Iceland
</label>,
<label for="q8076-3"><input id="q8076-3" name="q8076" onclick="check_answer('q8076correct','q8076incorrect','3','1');" type="radio"/>Italy
</label>,
<label for="q8076-4"><input id="q8076-4" name="q8076" onclick="check_answer('q8076correct','q8076incorrect','4','1');" type="radio"/>India</label>,
<label for="q1758-1"><input id="q1758-1" name="q1758" onclick="check_answer('q1758correct','q1758incorrect','1','1');" type="radio"/>6
</label>,
<label for="q1758-2"><input id="q1758-2" name="q1758" onclick="check_answer('q1758correct','q1758incorrect','2','1');" type="radio"/>5
</label>,
<label for="q1758-3"><input id="q1758-3" name="q1758" onclick="check_answer('q1758correct','q1758incorrect','3','1');" type="radio"/>4
</label>,
<label for="q1758-4"><input id="q1758-4" name="q1758" onclick="check_answer('q1758correct','q1758incorrect','4','1');" type="radio"/>7</label>,
<label for="q1756-1"><input id="q1756-1" name="q1756" onclick="check_answer('q1756correct','q1756incorrect','1','2');" type="radio"/>Sumerians
</label>,
<label for="q1756-2"><input id="q1756-2" name="q1756" onclick="check_answer('q1756correct','q1756incorrect','2','2');" type="radio"/>Ottoman
</label>,
<label for="q1756-3"><input id="q1756-3" name="q1756" onclick="check_answer('q1756correct','q1756incorrect','3','2');" type="radio"/>Babylonian
</label>,
<label for="q1756-4"><input id="q1756-4" name="q1756" onclick="check_answer('q1756correct','q1756incorrect','4','2');" type="radio"/>Assyrians</label>,
<label for="q1755-1"><input id="q1755-1" name="q1755" onclick="check_answer('q1755correct','q1755incorrect','1','2');" type="radio"/>1922
</label>,
<label for="q1755-2"><input id="q1755-2" name="q1755" onclick="check_answer('q1755correct','q1755incorrect','2','2');" type="radio"/>1932
</label>,
<label for="q1755-3"><input id="q1755-3" name="q1755" onclick="check_answer('q1755correct','q1755incorrect','3','2');" type="radio"/>1912
</label>,
<label for="q1755-4"><input id="q1755-4" name="q1755" onclick="check_answer('q1755correct','q1755incorrect','4','2');" type="radio"/>1942</label>]
Now i want to get the text between the label tags example: Japan , England. But when I iterate the result set it skips some entries.
for i in ans:
print(i)
</label>
</label>
</label>
<label for="q8086-4"><input id="q8086-4" name="q8086" onclick="check_answer('q8086correct','q8086incorrect','4','1');" type="radio"/>England</label>
</label>
</label>
</label>
<label for="q8085-4"><input id="q8085-4" name="q8085" onclick="check_answer('q8085correct','q8085incorrect','4','2');" type="radio"/>RGB</label>
</label>
</label>
</label>
<label for="q8082-4"><input id="q8082-4" name="q8082" onclick="check_answer('q8082correct','q8082incorrect','4','1');" type="radio"/>Hotel Rooms</label>
</label>
</label>
</label>
<label for="q8079-4"><input id="q8079-4" name="q8079" onclick="check_answer('q8079correct','q8079incorrect','4','2');" type="radio"/>United Kingdom</label>
</label>
</label>
</label>
<label for="q8078-4"><input id="q8078-4" name="q8078" onclick="check_answer('q8078correct','q8078incorrect','4','3');" type="radio"/>Guam</label>
</label>
</label>
</label>
<label for="q8077-4"><input id="q8077-4" name="q8077" onclick="check_answer('q8077correct','q8077incorrect','4','4');" type="radio"/>Japan</label>
</label>
</label>
</label>
<label for="q8076-4"><input id="q8076-4" name="q8076" onclick="check_answer('q8076correct','q8076incorrect','4','1');" type="radio"/>India</label>
</label>
</label>
</label>
<label for="q1758-4"><input id="q1758-4" name="q1758" onclick="check_answer('q1758correct','q1758incorrect','4','1');" type="radio"/>7</label>
</label>
</label>
</label>
<label for="q1756-4"><input id="q1756-4" name="q1756" onclick="check_answer('q1756correct','q1756incorrect','4','2');" type="radio"/>Assyrians</label>
</label>
</label>
</label>
<label for="q1755-4"><input id="q1755-4" name="q1755" onclick="check_answer('q1755correct','q1755incorrect','4','2');" type="radio"/>1942</label>
Can anyone tell me a method to get all the entries?
Additional help if possible: The tag also contains onclick method which contains the correct option onclick="check_answer('q1755correct','q1755incorrect','4','2'); If i can fetch that too it will be good but not the main priority here.
Upvotes: 1
Views: 107
Reputation: 195553
This script will print all questions, answers and will mark correct answer with <-- CORRECT
:
import requests
from bs4 import BeautifulSoup
url = 'https://www.atrochatro.com/quiz_world.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for question in soup.select('blockquote:has(b)'):
q = question.b.find_next_sibling(text=True).strip()
print(q)
correct = int(question.input['onclick'].split("'")[-2])
for i, l in enumerate(question.select('label'), 1):
print('{:<30} {}'.format(l.text.strip(), '<-- CORRECT' if i==correct else ''))
print('-'*80)
Prints:
The NTSC (National Television Standards Committee) is also used in the country of...?
Japan <-- CORRECT
France
Germany
England
--------------------------------------------------------------------------------
In the United States the television broadcast standard is...?
PAL
NTSC <-- CORRECT
SECAM
RGB
--------------------------------------------------------------------------------
In the UK, what type of installation requires a fireman's switch?
Neon Lighting <-- CORRECT
High Pressure Sodium Lighting
Water Features
Hotel Rooms
--------------------------------------------------------------------------------
Which country's Antarctic claim covers the greatest swath of longitude?
New Zealand
Australia <-- CORRECT
Argentina
United Kingdom
--------------------------------------------------------------------------------
Which Pacific entity is farthest north?
Federated States of Micronesia
Palau
Northern Mariana Islands <-- CORRECT
Guam
--------------------------------------------------------------------------------
Which country follows the United States and China in total number of Internet users?
Germany
United Kingdom
France
Japan <-- CORRECT
--------------------------------------------------------------------------------
Which country has the lowest rate of newspaper circulation per capita?
Indonesia <-- CORRECT
Iceland
Italy
India
--------------------------------------------------------------------------------
Iraq borders with how many countries?
6 <-- CORRECT
5
4
7
--------------------------------------------------------------------------------
In 1917-18 Iraq became independent from which Empire?
Sumerians
Ottoman <-- CORRECT
Babylonian
Assyrians
--------------------------------------------------------------------------------
In which year did the Republic of Iraq become independent?
1922
1932 <-- CORRECT
1912
1942
--------------------------------------------------------------------------------
Upvotes: 1