Reputation: 127
I have this code trying to parse search results from a grant website (please find the URL in the code, I can't post the link yet until my rep is higher), the "Year"and "Amount Award" after tags and before tags.
Two questions:
1) Why is this only returning the 1st table?
2) Any way I can get the text that is after the (i.e. Year and Amount Award strings) and (i.e. the actual number such as 2015 and $100000)
Specifically:
<td valign="top">
<b>Year: </b>2014<br>
<b>Award Amount: </b>$84,907 </td>
Here is my script:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.ned.org/wp-content/themes/ned/search/grant-search.php?' \
'organizationName=®ion=ASIA&projectCountry=China&amount=&fromDate=&toDate=&' \
'projectFocus%5B%5D=&search=&maxCount=25&orderBy=Year&start=1&sbmt=1'
r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, "html.parser")
tables = soup.find_all('table')
data = {
'col_names': [],
'info' : [],
'year_amount':[]
}
index = 0
for table in tables:
rows = table.find_all('tr')[1:]
for row in rows:
cols = row.find_all('td')
data['col_names'].append(cols[0].get_text())
data['info'].append(cols[1].get_text())
try:
data['year_amount'].append(cols[2].get_text())
except IndexError:
data['year_amount'].append(None)
grant_df = pd.DataFrame(data)
index += 1
filename = 'grant ' + str(index) + '.csv'
grant_df.to_csv(filename)
Upvotes: 3
Views: 535
Reputation: 49832
I would suggest approaching the table parsing in a different manner. All of the information is available in the first row of each table. So you can parse the text of the row like:
text = '\n'.join([x.strip() for x in rows[0].get_text().split('\n')
if x.strip()]).replace(':\n', ': ')
data_dict = {k.strip(): v.strip() for k, v in
[x.split(':', 1) for x in text.split('\n')]}
This takes the text and
:
with the next lineThen:
:
:
dict
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.ned.org/wp-content/themes/ned/search/grant-search.php?' \
'organizationName=®ion=ASIA&projectCountry=China&amount=&' \
'fromDate=&toDate=&projectFocus%5B%5D=&search=&maxCount=25&' \
'orderBy=Year&start=1&sbmt=1'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
data = []
for table in soup.find_all('table'):
rows = table.find_all('tr')
text = '\n'.join([x.strip() for x in rows[0].get_text().split('\n')
if x.strip()]).replace(':\n', ': ')
data_dict = {k.strip(): v.strip() for k, v in
[x.split(':', 1) for x in text.split('\n')]}
if data_dict.get('Award Amount'):
data.append(data_dict)
grant_df = pd.DataFrame(data)
print(grant_df.head())
Award Amount Description \
0 $84,907 To strengthen the capacity of China's rights d...
1 $204,973 To provide an effective forum for free express...
2 $48,000 To promote religious freedom in China. The org...
3 $89,000 To educate and train civil society activists o...
4 $65,000 To encourage greater public discussion, transp...
Organization Name Project Country Project Focus \
0 NaN Mainland China Rule of Law
1 Princeton China Initiative Mainland China Freedom of Information
2 NaN Mainland China Rule of Law
3 NaN Mainland China Democratic Ideas and Values
4 NaN Mainland China Rule of Law
Project Region Project Title Year
0 Asia Empowering the Chinese Legal Community 2014
1 Asia Supporting Free Expression and Open Debate for... 2014
2 Asia Religious Freedom, Rights Defense and Rule of ... 2014
3 Asia Education on Civil Society and Democratization 2014
4 Asia Promoting Democratic Policy Change in China 2014
Upvotes: 1