How to scrape a table from a webpage and exclude specific tables neseted inside tables tag

Question

I wish to scrape a table from a specific webpage. The issue is that some of the table's td contains a nested span tag containing another nested table.

The webpage that I want to scrape from is the following Click here .

I have included a small sample of the table that I want to scrape with the nested table contained within span tag with a class tooltip-icon. How can I exclude the contents inside these specific span tags when scraping the whole table


Abhanpur
53

    
        
            
                DHANENDRA SAHU
                i
                    
                        Assembly Election Result 2013
                        
                            
                                
                                    Party
                                    :
                                    Indian National Congress
                                
                                
                                    Result
                                    :
                                    WON
                                
                                
                                    Margin
                                    :
                                    8354
                                
                            
                        
                    
                
            
        
    


    
        
            
                Indian National Congress
                i
                    
                        Current Assembly Election Result
                        
                            
                                
                                    Leading In
                                    :
                                    0
                                
                                
                                    Won In
                                    :
                                    68
                                
                                
                                    Trailing In
                                    :
                                    0
                                
                            
                        
                    
                
            
        
    

CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA

    
        
            
                Bharatiya Janata Party
                i
                    
                        Current Assembly Election Result
                        
                            
                                
                                    Leading In
                                    :
                                    0
                                
                                
                                    Won In
                                    :
                                    15
                                
                                
                                    Trailing In
                                    :
                                    0
                                
                            
                        
                    
                
            
        
    

23471 
Result Declared
DHANENDRA SAHU
Indian National Congress
8354

I am also including the full python script I currently use to scrape the table. I have successfully scrape the whole table but unable to exclude the nested span and table content.

full scraper code here

The Out put that I am currently getting in a csv format is as follows (Just the sample row out of the whole set). In the 3rd column the span tag also gets scraped as indicated by "iAssembly Election Result"

Abhanpur,53,DHANENDRA SAHUiAssembly Election Result 2013Party:Indian National CongressResult:WONMargin:8354,DHANENDRA SAHU,iAssembly Election Result 2013Party:Indian National CongressResult:WONMargin:8354,Party,:,Indian National Congress,Result,:,WON,Margin,:,8354,Indian National CongressiCurrent Assembly Election ResultLeading In:0Won In:68Trailing In:0,Indian National Congress,iCurrent Assembly Election ResultLeading In:0Won In:68Trailing In:0,Leading In,:,0,Won In,:,68,Trailing In,:,0,CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA,Bharatiya Janata PartyiCurrent Assembly Election ResultLeading In:0Won In:15Trailing In:0,Bharatiya Janata Party,iCurrent Assembly Election ResultLeading In:0Won In:15Trailing In:0,Leading In,:,0,Won In,:,15,Trailing In,:,0,23471                                             ,Result Declared,DHANENDRA SAHU,Indian National Congress,8354,

The expected out put is to scrape the table excluding the span tags and its nested tables. for example

Abhanpur, 53 , DHANENDRA SAHU, Indian National Congress, CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA, Bharatiya Janata Party , 23471, Result Declared

Any help on this would be very helpful. thanks.

chitown88 · Accepted Answer

This is just my preference, but whenever I see

tags, I utilise Pandas to do the parsing, then just manipulate the dataframe as needed. It also allows you to write to file in one line:

import pandas as pd

results_df = pd.DataFrame()
url_list = [1,2,3,4,5,6,7,8]
url = 'http://eciresults.nic.in/Statewises26.htm'

dfs = pd.read_html(url)
df = dfs[0]

idx = df[df[0] == '1\xa02\xa03\xa04\xa05\xa06\xa07\xa08\xa09\xa0Next >>'].index[0]
cols = list(df.iloc[idx-1,:])


df.columns = cols

df = df[df['Const. No.'].notnull()]
df = df.loc[df['Const. No.'].str.isdigit()].reset_index(drop=True)
df = df.dropna(axis=1,how='all')

df['Leading Candidate'] = df['Leading Candidate'].str.split('i',expand=True)[0]
df['Leading Party'] = df['Leading Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Party'] = df['Trailing Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Candidate'] = df['Trailing Candidate'].str.split('iAssembly',expand=True)[0]

results_df = results_df.append(df)

for x in url_list:
    url = 'http://eciresults.nic.in/Statewises26%s.htm' %x
    print ('Processed %s' %url)
    dfs = pd.read_html(url)
    df = dfs[0]

    df.columns = cols

    df = df[df['Const. No.'].notnull()]
    df = df.loc[df['Const. No.'].str.isdigit()].reset_index(drop=True)
    df = df.dropna(axis=1,how='all')

    df['Leading Candidate'] = df['Leading Candidate'].str.split('i',expand=True)[0]
    df['Leading Party'] = df['Leading Party'].str.split('iCurrent',expand=True)[0]
    df['Trailing Party'] = df['Trailing Party'].str.split('iCurrent',expand=True)[0]
    df['Trailing Candidate'] = df['Trailing Candidate'].str.split('iAssembly',expand=True)[0]

    results_df = results_df.append(df).reset_index(drop=True)

results_df.to_csv('Chhattisgarh_cand.csv', index=False)

Output:

print (df.to_string())
  Constituency Const. No.       Leading Candidate                    Leading Party                    Trailing Candidate            Trailing Party Margin           Status          Winning Candidate             Winning Party Margin
0     Abhanpur         53          DHANENDRA SAHU         Indian National Congress  CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA    Bharatiya Janata Party  23471  Result Declared             DHANENDRA SAHU  Indian National Congress   8354
1      Ahiwara         67        GURU RUDRA KUMAR         Indian National Congress            RAJMAHANT SANWLA RAM DAHRE    Bharatiya Janata Party  31687  Result Declared  RAJMAHNT SANWLA RAM DAHRE    Bharatiya Janata Party  31676
2     Akaltara         33           SAURABH SINGH           Bharatiya Janata Party                            RICHA JOGI       Bahujan Samaj Party   1854  Result Declared             CHUNNILAL SAHU  Indian National Congress  21693
3    Ambikapur         10               T.S. BABA         Indian National Congress                      ANURAG SINGH DEO    Bharatiya Janata Party  39624  Result Declared                   T.S.BABA  Indian National Congress  19558
4     Antagarh         79               ANOOP NAG         Indian National Congress                         VIKRAM USENDI    Bharatiya Janata Party  13414  Result Declared              VIKRAM USENDI    Bharatiya Janata Party   5171
5        Arang         52  DR. SHIVKUMAR DAHARIYA         Indian National Congress                         SANJAY DHIDHI    Bharatiya Janata Party  25077  Result Declared           NAVEEN MARKANDEY    Bharatiya Janata Party  13774
6  Baikunthpur          3        AMBICA SINGH DEO         Indian National Congress                     BHAIYALAL RAJWADE    Bharatiya Janata Party   5339  Result Declared          BHAIYALAL RAJWADE    Bharatiya Janata Party   1069
7  Balodabazar         45     PRAMOD KUMAR SHARMA  Janta Congress Chhattisgarh (J)                       JANAK RAM VERMA  Indian National Congress   2129  Result Declared            JANAK RAM VERMA  Indian National Congress   9977
8        Basna         40  DEVENDRA BAHADUR SINGH         Indian National Congress                        SAMPAT AGRAWAL               Independent  17508  Result Declared        RUPKUMARI CHOUDHARY    Bharatiya Janata Party   6239
9       Bastar         85       BAGHEL LAKHESHWAR         Indian National Congress                    DR. SUBHAU KASHYAP    Bharatiya Janata Party  33471  Result Declared          BAGHEL LAKHESHWAR  Indian National Congress  19168

How to scrape a table from a webpage and exclude specific tables neseted inside tables<td> tag

Answers (2)

Related Questions

How to scrape a table from a webpage and exclude specific tables neseted inside tables&lt;td&gt; tag

Answers (2)

Related Questions

How to scrape a table from a webpage and exclude specific tables neseted inside tables<td> tag