Reputation: 33
I wish to scrape a table from a specific webpage. The issue is that some of the table's td contains a nested span tag containing another nested table.
The webpage that I want to scrape from is the following Click here .
I have included a small sample of the table that I want to scrape with the nested table contained within span tag with a class tooltip-icon. How can I exclude the contents inside these specific span tags when scraping the whole table
<tr style="font-size:12px;">
<td align="left">Abhanpur</td>
<td align="center">53</td>
<td align="left">
<table>
<tbody>
<tr>
<td>DHANENDRA SAHU</td>
<td style="vertical-align:top"><span class="tooltip-icon" style="display:block">i</span>
<div class="tooltip">
<h3>Assembly Election Result 2013</h3>
<table>
<tbody>
<tr>
<td>Party</td>
<td>:</td>
<td>Indian National Congress</td>
</tr>
<tr>
<td>Result</td>
<td>:</td>
<td>WON</td>
</tr>
<tr>
<td>Margin</td>
<td>:</td>
<td>8354</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</tbody>
</table>
</td>
<td align="left">
<table>
<tbody>
<tr>
<td>Indian National Congress</td>
<td style="vertical-align:top"><span class="tooltip-icon" style="display:block">i</span>
<div class="tooltip">
<h3>Current Assembly Election Result</h3>
<table>
<tbody>
<tr>
<td>Leading In</td>
<td>:</td>
<td>0</td>
</tr>
<tr>
<td>Won In</td>
<td>:</td>
<td>68</td>
</tr>
<tr>
<td>Trailing In</td>
<td>:</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</tbody>
</table>
</td>
<td align="left">CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA</td>
<td align="left">
<table>
<tbody>
<tr>
<td>Bharatiya Janata Party</td>
<td style="vertical-align:top"><span class="tooltip-icon" style="display:block">i</span>
<div class="tooltip">
<h3>Current Assembly Election Result</h3>
<table>
<tbody>
<tr>
<td>Leading In</td>
<td>:</td>
<td>0</td>
</tr>
<tr>
<td>Won In</td>
<td>:</td>
<td>15</td>
</tr>
<tr>
<td>Trailing In</td>
<td>:</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</tbody>
</table>
</td>
<td align="right">23471 </td>
<td align="center">Result Declared</td>
<td align="center" style="background-color: lightgray;">DHANENDRA SAHU</td>
<td align="center" style="background-color: lightgray;">Indian National Congress</td>
<td align="center" style="background-color: lightgray;">8354</td>
I am also including the full python script I currently use to scrape the table. I have successfully scrape the whole table but unable to exclude the nested span and table content.
The Out put that I am currently getting in a csv format is as follows (Just the sample row out of the whole set). In the 3rd column the span tag also gets scraped as indicated by "iAssembly Election Result"
Abhanpur,53,DHANENDRA SAHUiAssembly Election Result 2013Party:Indian National CongressResult:WONMargin:8354,DHANENDRA SAHU,iAssembly Election Result 2013Party:Indian National CongressResult:WONMargin:8354,Party,:,Indian National Congress,Result,:,WON,Margin,:,8354,Indian National CongressiCurrent Assembly Election ResultLeading In:0Won In:68Trailing In:0,Indian National Congress,iCurrent Assembly Election ResultLeading In:0Won In:68Trailing In:0,Leading In,:,0,Won In,:,68,Trailing In,:,0,CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA,Bharatiya Janata PartyiCurrent Assembly Election ResultLeading In:0Won In:15Trailing In:0,Bharatiya Janata Party,iCurrent Assembly Election ResultLeading In:0Won In:15Trailing In:0,Leading In,:,0,Won In,:,15,Trailing In,:,0,23471 ,Result Declared,DHANENDRA SAHU,Indian National Congress,8354,
The expected out put is to scrape the table excluding the span tags and its nested tables. for example
Abhanpur, 53 , DHANENDRA SAHU, Indian National Congress, CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA, Bharatiya Janata Party , 23471, Result Declared
Any help on this would be very helpful. thanks.
Upvotes: 1
Views: 357
Reputation: 28565
This is just my preference, but whenever I see <table>
tags, I utilise Pandas to do the parsing, then just manipulate the dataframe as needed. It also allows you to write to file in one line:
import pandas as pd
results_df = pd.DataFrame()
url_list = [1,2,3,4,5,6,7,8]
url = 'http://eciresults.nic.in/Statewises26.htm'
dfs = pd.read_html(url)
df = dfs[0]
idx = df[df[0] == '1\xa02\xa03\xa04\xa05\xa06\xa07\xa08\xa09\xa0Next >>'].index[0]
cols = list(df.iloc[idx-1,:])
df.columns = cols
df = df[df['Const. No.'].notnull()]
df = df.loc[df['Const. No.'].str.isdigit()].reset_index(drop=True)
df = df.dropna(axis=1,how='all')
df['Leading Candidate'] = df['Leading Candidate'].str.split('i',expand=True)[0]
df['Leading Party'] = df['Leading Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Party'] = df['Trailing Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Candidate'] = df['Trailing Candidate'].str.split('iAssembly',expand=True)[0]
results_df = results_df.append(df)
for x in url_list:
url = 'http://eciresults.nic.in/Statewises26%s.htm' %x
print ('Processed %s' %url)
dfs = pd.read_html(url)
df = dfs[0]
df.columns = cols
df = df[df['Const. No.'].notnull()]
df = df.loc[df['Const. No.'].str.isdigit()].reset_index(drop=True)
df = df.dropna(axis=1,how='all')
df['Leading Candidate'] = df['Leading Candidate'].str.split('i',expand=True)[0]
df['Leading Party'] = df['Leading Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Party'] = df['Trailing Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Candidate'] = df['Trailing Candidate'].str.split('iAssembly',expand=True)[0]
results_df = results_df.append(df).reset_index(drop=True)
results_df.to_csv('Chhattisgarh_cand.csv', index=False)
Output:
print (df.to_string())
Constituency Const. No. Leading Candidate Leading Party Trailing Candidate Trailing Party Margin Status Winning Candidate Winning Party Margin
0 Abhanpur 53 DHANENDRA SAHU Indian National Congress CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA Bharatiya Janata Party 23471 Result Declared DHANENDRA SAHU Indian National Congress 8354
1 Ahiwara 67 GURU RUDRA KUMAR Indian National Congress RAJMAHANT SANWLA RAM DAHRE Bharatiya Janata Party 31687 Result Declared RAJMAHNT SANWLA RAM DAHRE Bharatiya Janata Party 31676
2 Akaltara 33 SAURABH SINGH Bharatiya Janata Party RICHA JOGI Bahujan Samaj Party 1854 Result Declared CHUNNILAL SAHU Indian National Congress 21693
3 Ambikapur 10 T.S. BABA Indian National Congress ANURAG SINGH DEO Bharatiya Janata Party 39624 Result Declared T.S.BABA Indian National Congress 19558
4 Antagarh 79 ANOOP NAG Indian National Congress VIKRAM USENDI Bharatiya Janata Party 13414 Result Declared VIKRAM USENDI Bharatiya Janata Party 5171
5 Arang 52 DR. SHIVKUMAR DAHARIYA Indian National Congress SANJAY DHIDHI Bharatiya Janata Party 25077 Result Declared NAVEEN MARKANDEY Bharatiya Janata Party 13774
6 Baikunthpur 3 AMBICA SINGH DEO Indian National Congress BHAIYALAL RAJWADE Bharatiya Janata Party 5339 Result Declared BHAIYALAL RAJWADE Bharatiya Janata Party 1069
7 Balodabazar 45 PRAMOD KUMAR SHARMA Janta Congress Chhattisgarh (J) JANAK RAM VERMA Indian National Congress 2129 Result Declared JANAK RAM VERMA Indian National Congress 9977
8 Basna 40 DEVENDRA BAHADUR SINGH Indian National Congress SAMPAT AGRAWAL Independent 17508 Result Declared RUPKUMARI CHOUDHARY Bharatiya Janata Party 6239
9 Bastar 85 BAGHEL LAKHESHWAR Indian National Congress DR. SUBHAU KASHYAP Bharatiya Janata Party 33471 Result Declared BAGHEL LAKHESHWAR Indian National Congress 19168
Upvotes: 1
Reputation: 24930
You can do it with pandas, using this:
import pandas as pd
page = pd.read_html('http://eciresults.nic.in/Statewises26.htm')
my_table = page[5]
That gets you a pandas dataframe containing the table you're interested in, I believe. If you try:
my_table.iloc[[7]]
The output is:
7 Abhanpur 53 DHANENDRA SAHUiAssembly Election Result 2013Pa... Indian National CongressiCurrent Assembly Elec... CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA Bharatiya Janata PartyiCurrent Assembly Electi... 23471 Result Declared DHANENDRA SAHU Indian National Congress 8354 NaN NaN
If that's what you are after, you can clean up your table using standard pandas methods.
Upvotes: 1