Web Scraping of nested div elements with repeating class names

Question


                Statewise
                Cases Across India
                
                   
                     3,86,351
                     
                                                
                            2,157 
                        
                    
                    
                    Active Cases  
                       (1.21%)
                     
                   
                
                
                

                
                    
                    
                        Discharged 
                           
                           (97.45%)
                           
                        
                        3,12,20,981
                        
                                                
                            40,013 
                        
                    
                    
                    
                
                
                    
                        
                        Deaths  
                          
                                (1.34%)
                          
                        
                        4,29,179
                        
                                                
                            497 
                        
                    
                     
                

            
                    
                        Total Cases 
                            
                        
                        3,20,36,511
                        
                                                
                           38,353

I am working on a web scraping project using python and beautifulsoup. As a beginner I am unable to parse the data which I need (Numerical Statistics on covid) since the class names which contain the numerical data are repeated and not unique like icount, per_block,increase_block. What I want is to parse and store only these numerical data in different variables like below-

Total_cases = 3,20,36,511
Total_cases_in_last_24_hrs = 38,353 and likewise for all other categories(Discharge, deaths, active cases)

Here is my code-

    URL = 'https://www.mygov.in/covid-19/'
    page = requests.get(URL,headers=headers)
    clean_data=BeautifulSoup(page.text,'html.parser')
    span=clean_data.findAll('span',class_='icount')
    #print(clean_data)  

    total_cases = clean_data.find("div",class_="iblock 
    t_case",attrs={'spanclass':'icount'}).get_text()
    print(total_cases)

I have been working on it for long time but could not find a solution. Please help. This is the reference code from Click here to visit the website.

Thank You.

Andrej Kesely · Accepted Answer

One possible solution is to select all text from class="t_case" and split the text:

import requests
from bs4 import BeautifulSoup

url = "https://www.mygov.in/covid-19/"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
_, total_cases, new_cases = (
    soup.select_one(".t_case").get_text(strip=True, separator="|").split("|")
)
print(total_cases)
print(new_cases)

Prints:

3,20,36,511
38,353

Or:

t_case = soup.select_one(".t_case")

total_cases = t_case.select_one(".icount")
new_cases = t_case.select_one(".color-red, .color-green")

print(total_cases.get_text(strip=True))
print(new_cases.get_text(strip=True))

Web Scraping of nested div elements with repeating class names

Answers (1)

Related Questions