Divyansh Tiwari
Divyansh Tiwari

Reputation: 95

Getting Text Between two html tags: Python web scraping (Text getting skipped on iterating the result set)

I'm trying to scrape a website to get some text. This is what i have performed.

page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")
ans=soup.find_all("label")
print(ans)

And this is the output:

[<label for="q8086-1"><input id="q8086-1" name="q8086" onclick="check_answer('q8086correct','q8086incorrect','1','1');" type="radio"/>Japan
 </label>,
 <label for="q8086-2"><input id="q8086-2" name="q8086" onclick="check_answer('q8086correct','q8086incorrect','2','1');" type="radio"/>France
 </label>,
 <label for="q8086-3"><input id="q8086-3" name="q8086" onclick="check_answer('q8086correct','q8086incorrect','3','1');" type="radio"/>Germany
 </label>,
 <label for="q8086-4"><input id="q8086-4" name="q8086" onclick="check_answer('q8086correct','q8086incorrect','4','1');" type="radio"/>England</label>,
 <label for="q8085-1"><input id="q8085-1" name="q8085" onclick="check_answer('q8085correct','q8085incorrect','1','2');" type="radio"/>PAL
 </label>,
 <label for="q8085-2"><input id="q8085-2" name="q8085" onclick="check_answer('q8085correct','q8085incorrect','2','2');" type="radio"/>NTSC
 </label>,
 <label for="q8085-3"><input id="q8085-3" name="q8085" onclick="check_answer('q8085correct','q8085incorrect','3','2');" type="radio"/>SECAM
 </label>,
 <label for="q8085-4"><input id="q8085-4" name="q8085" onclick="check_answer('q8085correct','q8085incorrect','4','2');" type="radio"/>RGB</label>,
 <label for="q8082-1"><input id="q8082-1" name="q8082" onclick="check_answer('q8082correct','q8082incorrect','1','1');" type="radio"/>Neon Lighting
 </label>,
 <label for="q8082-2"><input id="q8082-2" name="q8082" onclick="check_answer('q8082correct','q8082incorrect','2','1');" type="radio"/>High Pressure Sodium Lighting
 </label>,
 <label for="q8082-3"><input id="q8082-3" name="q8082" onclick="check_answer('q8082correct','q8082incorrect','3','1');" type="radio"/>Water Features
 </label>,
 <label for="q8082-4"><input id="q8082-4" name="q8082" onclick="check_answer('q8082correct','q8082incorrect','4','1');" type="radio"/>Hotel Rooms</label>,
 <label for="q8079-1"><input id="q8079-1" name="q8079" onclick="check_answer('q8079correct','q8079incorrect','1','2');" type="radio"/>New Zealand
 </label>,
 <label for="q8079-2"><input id="q8079-2" name="q8079" onclick="check_answer('q8079correct','q8079incorrect','2','2');" type="radio"/>Australia
 </label>,
 <label for="q8079-3"><input id="q8079-3" name="q8079" onclick="check_answer('q8079correct','q8079incorrect','3','2');" type="radio"/>Argentina
 </label>,
 <label for="q8079-4"><input id="q8079-4" name="q8079" onclick="check_answer('q8079correct','q8079incorrect','4','2');" type="radio"/>United Kingdom</label>,
 <label for="q8078-1"><input id="q8078-1" name="q8078" onclick="check_answer('q8078correct','q8078incorrect','1','3');" type="radio"/>Federated States of Micronesia
 </label>,
 <label for="q8078-2"><input id="q8078-2" name="q8078" onclick="check_answer('q8078correct','q8078incorrect','2','3');" type="radio"/>Palau
 </label>,
 <label for="q8078-3"><input id="q8078-3" name="q8078" onclick="check_answer('q8078correct','q8078incorrect','3','3');" type="radio"/>Northern Mariana Islands
 </label>,
 <label for="q8078-4"><input id="q8078-4" name="q8078" onclick="check_answer('q8078correct','q8078incorrect','4','3');" type="radio"/>Guam</label>,
 <label for="q8077-1"><input id="q8077-1" name="q8077" onclick="check_answer('q8077correct','q8077incorrect','1','4');" type="radio"/>Germany
 </label>,
 <label for="q8077-2"><input id="q8077-2" name="q8077" onclick="check_answer('q8077correct','q8077incorrect','2','4');" type="radio"/>United Kingdom
 </label>,
 <label for="q8077-3"><input id="q8077-3" name="q8077" onclick="check_answer('q8077correct','q8077incorrect','3','4');" type="radio"/>France
 </label>,
 <label for="q8077-4"><input id="q8077-4" name="q8077" onclick="check_answer('q8077correct','q8077incorrect','4','4');" type="radio"/>Japan</label>,
 <label for="q8076-1"><input id="q8076-1" name="q8076" onclick="check_answer('q8076correct','q8076incorrect','1','1');" type="radio"/>Indonesia
 </label>,
 <label for="q8076-2"><input id="q8076-2" name="q8076" onclick="check_answer('q8076correct','q8076incorrect','2','1');" type="radio"/>Iceland
 </label>,
 <label for="q8076-3"><input id="q8076-3" name="q8076" onclick="check_answer('q8076correct','q8076incorrect','3','1');" type="radio"/>Italy
 </label>,
 <label for="q8076-4"><input id="q8076-4" name="q8076" onclick="check_answer('q8076correct','q8076incorrect','4','1');" type="radio"/>India</label>,
 <label for="q1758-1"><input id="q1758-1" name="q1758" onclick="check_answer('q1758correct','q1758incorrect','1','1');" type="radio"/>6
 </label>,
 <label for="q1758-2"><input id="q1758-2" name="q1758" onclick="check_answer('q1758correct','q1758incorrect','2','1');" type="radio"/>5
 </label>,
 <label for="q1758-3"><input id="q1758-3" name="q1758" onclick="check_answer('q1758correct','q1758incorrect','3','1');" type="radio"/>4
 </label>,
 <label for="q1758-4"><input id="q1758-4" name="q1758" onclick="check_answer('q1758correct','q1758incorrect','4','1');" type="radio"/>7</label>,
 <label for="q1756-1"><input id="q1756-1" name="q1756" onclick="check_answer('q1756correct','q1756incorrect','1','2');" type="radio"/>Sumerians
 </label>,
 <label for="q1756-2"><input id="q1756-2" name="q1756" onclick="check_answer('q1756correct','q1756incorrect','2','2');" type="radio"/>Ottoman
 </label>,
 <label for="q1756-3"><input id="q1756-3" name="q1756" onclick="check_answer('q1756correct','q1756incorrect','3','2');" type="radio"/>Babylonian
 </label>,
 <label for="q1756-4"><input id="q1756-4" name="q1756" onclick="check_answer('q1756correct','q1756incorrect','4','2');" type="radio"/>Assyrians</label>,
 <label for="q1755-1"><input id="q1755-1" name="q1755" onclick="check_answer('q1755correct','q1755incorrect','1','2');" type="radio"/>1922
 </label>,
 <label for="q1755-2"><input id="q1755-2" name="q1755" onclick="check_answer('q1755correct','q1755incorrect','2','2');" type="radio"/>1932
 </label>,
 <label for="q1755-3"><input id="q1755-3" name="q1755" onclick="check_answer('q1755correct','q1755incorrect','3','2');" type="radio"/>1912
 </label>,
 <label for="q1755-4"><input id="q1755-4" name="q1755" onclick="check_answer('q1755correct','q1755incorrect','4','2');" type="radio"/>1942</label>]

Now i want to get the text between the label tags example: Japan , England. But when I iterate the result set it skips some entries.

for i in ans:
    print(i)

</label>
</label>
</label>
<label for="q8086-4"><input id="q8086-4" name="q8086" onclick="check_answer('q8086correct','q8086incorrect','4','1');" type="radio"/>England</label>
</label>
</label>
</label>
<label for="q8085-4"><input id="q8085-4" name="q8085" onclick="check_answer('q8085correct','q8085incorrect','4','2');" type="radio"/>RGB</label>
</label>
</label>
</label>
<label for="q8082-4"><input id="q8082-4" name="q8082" onclick="check_answer('q8082correct','q8082incorrect','4','1');" type="radio"/>Hotel Rooms</label>
</label>
</label>
</label>
<label for="q8079-4"><input id="q8079-4" name="q8079" onclick="check_answer('q8079correct','q8079incorrect','4','2');" type="radio"/>United Kingdom</label>
</label>
</label>
</label>
<label for="q8078-4"><input id="q8078-4" name="q8078" onclick="check_answer('q8078correct','q8078incorrect','4','3');" type="radio"/>Guam</label>
</label>
</label>
</label>
<label for="q8077-4"><input id="q8077-4" name="q8077" onclick="check_answer('q8077correct','q8077incorrect','4','4');" type="radio"/>Japan</label>
</label>
</label>
</label>
<label for="q8076-4"><input id="q8076-4" name="q8076" onclick="check_answer('q8076correct','q8076incorrect','4','1');" type="radio"/>India</label>
</label>
</label>
</label>
<label for="q1758-4"><input id="q1758-4" name="q1758" onclick="check_answer('q1758correct','q1758incorrect','4','1');" type="radio"/>7</label>
</label>
</label>
</label>
<label for="q1756-4"><input id="q1756-4" name="q1756" onclick="check_answer('q1756correct','q1756incorrect','4','2');" type="radio"/>Assyrians</label>
</label>
</label>
</label>
<label for="q1755-4"><input id="q1755-4" name="q1755" onclick="check_answer('q1755correct','q1755incorrect','4','2');" type="radio"/>1942</label>

Can anyone tell me a method to get all the entries?

Additional help if possible: The tag also contains onclick method which contains the correct option onclick="check_answer('q1755correct','q1755incorrect','4','2'); If i can fetch that too it will be good but not the main priority here.

Upvotes: 1

Views: 107

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195553

This script will print all questions, answers and will mark correct answer with <-- CORRECT:

import requests
from bs4 import BeautifulSoup


url = 'https://www.atrochatro.com/quiz_world.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for question in soup.select('blockquote:has(b)'):
    q = question.b.find_next_sibling(text=True).strip()
    print(q)
    correct = int(question.input['onclick'].split("'")[-2])
    for i, l in enumerate(question.select('label'), 1):
        print('{:<30} {}'.format(l.text.strip(), '<-- CORRECT' if i==correct else ''))
    print('-'*80)

Prints:

The NTSC (National Television Standards Committee) is also used in the country of...?
Japan                          <-- CORRECT
France                         
Germany                        
England                        
--------------------------------------------------------------------------------
In the United States the television broadcast standard is...?
PAL                            
NTSC                           <-- CORRECT
SECAM                          
RGB                            
--------------------------------------------------------------------------------
In the UK, what type of installation requires a fireman's switch?
Neon Lighting                  <-- CORRECT
High Pressure Sodium Lighting  
Water Features                 
Hotel Rooms                    
--------------------------------------------------------------------------------
Which country's Antarctic claim covers the greatest swath of longitude?
New Zealand                    
Australia                      <-- CORRECT
Argentina                      
United Kingdom                 
--------------------------------------------------------------------------------
Which Pacific entity is farthest north?
Federated States of Micronesia 
Palau                          
Northern Mariana Islands       <-- CORRECT
Guam                           
--------------------------------------------------------------------------------
Which country follows the United States and China in total number of Internet users?
Germany                        
United Kingdom                 
France                         
Japan                          <-- CORRECT
--------------------------------------------------------------------------------
Which country has the lowest rate of newspaper circulation per capita?
Indonesia                      <-- CORRECT
Iceland                        
Italy                          
India                          
--------------------------------------------------------------------------------
Iraq borders with how many countries?
6                              <-- CORRECT
5                              
4                              
7                              
--------------------------------------------------------------------------------
In 1917-18 Iraq became independent from which Empire?
Sumerians                      
Ottoman                        <-- CORRECT
Babylonian                     
Assyrians                      
--------------------------------------------------------------------------------
In which year did the Republic of Iraq become independent?
1922                           
1932                           <-- CORRECT
1912                           
1942                           
--------------------------------------------------------------------------------

Upvotes: 1

Related Questions