Reputation: 11
I am scraping lists of US presidents using beautiful soup and requests. I want to scrape both the date for example start of the presidency and end of the presidency date and for some reason it's showing list index out of range error . I'll Provide you the link so you can understand better . website Link : https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = BeautifulSoup(page_html , 'html.parser' )
containers = page_soup.find_all('table' , class_ = 'wikitable')
#print(containers[0])
#print(len(containers))
#print(soup.prettify(containers[0]))
container = containers[0]
date =container.find_all('span' , attrs = {'class': 'date'})
#print(len(date))
#print(date[0].text)
for container in containers:
date_container = container.find_all('span', attrs={'class': 'date'})
print(date_container[0].text)
Upvotes: 0
Views: 125
Reputation: 28565
Since it has <table>
tags, have you considered using pandas' .read_html()
? It uses BeautifulSoup under the hood. Takes alot of the work out and puts it straight into a dataframe for you. The only work then needed is any manipulation or cleanup/filtering:
import pandas as pd
import re
my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'
# Returns a list of dataframes
dfs = pd.read_html(my_url)
# Get the specific dataframe with the desired columns
df = dfs[1].iloc[:,[1,3]]
# Rename the columns
df.columns = ['Date','Name']
# Split the date column into start and end dates and drop the date column
df[['Start','End']] = df.Date.str.split('–', expand=True)
df = df.drop('Date',axis=1)
# Clean up the name column using regex to pull out the name
df['Name'] = [re.match(r'.+?(?=\d)', x)[0].strip().split('Born')[0] for x in df['Name']]
# Drop duplicate rows
df.drop_duplicates(inplace = True)
print (df)
Output:
print (df.to_string())
Name Start End
0 George Washington April 30, 1789[d] March 4, 1797
1 John Adams March 4, 1797 March 4, 1801
2 Thomas Jefferson March 4, 1801 March 4, 1809
3 James Madison March 4, 1809 March 4, 1817
4 James Monroe March 4, 1817 March 4, 1825
5 John Quincy Adams March 4, 1825 March 4, 1829
6 Andrew Jackson March 4, 1829 March 4, 1837
7 Martin Van Buren March 4, 1837 March 4, 1841
8 William Henry Harrison March 4, 1841 April 4, 1841(Died in office)
9 John Tyler April 4, 1841[i] March 4, 1845
10 James K. Polk March 4, 1845 March 4, 1849
11 Zachary Taylor March 4, 1849 July 9, 1850(Died in office)
12 Millard Fillmore July 9, 1850[k] March 4, 1853
13 Franklin Pierce March 4, 1853 March 4, 1857
14 James Buchanan March 4, 1857 March 4, 1861
15 Abraham Lincoln March 4, 1861 April 15, 1865(Assassinated)
16 Andrew Johnson April 15, 1865 March 4, 1869
17 Ulysses S. Grant March 4, 1869 March 4, 1877
18 Rutherford B. Hayes March 4, 1877 March 4, 1881
19 James A. Garfield March 4, 1881 September 19, 1881(Assassinated)
20 Chester A. Arthur September 19, 1881[n] March 4, 1885
21 Grover Cleveland March 4, 1885 March 4, 1889
22 Benjamin Harrison March 4, 1889 March 4, 1893
23 Grover Cleveland March 4, 1893 March 4, 1897
24 William McKinley March 4, 1897 September 14, 1901(Assassinated)
25 Theodore Roosevelt September 14, 1901 March 4, 1909
26 William Howard Taft March 4, 1909 March 4, 1913
27 Woodrow Wilson March 4, 1913 March 4, 1921
28 Warren G. Harding March 4, 1921 August 2, 1923(Died in office)
29 Calvin Coolidge August 2, 1923[o] March 4, 1929
30 Herbert Hoover March 4, 1929 March 4, 1933
31 Franklin D. Roosevelt March 4, 1933 April 12, 1945(Died in office)
32 Harry S. Truman April 12, 1945 January 20, 1953
33 Dwight D. Eisenhower January 20, 1953 January 20, 1961
34 John F. Kennedy January 20, 1961 November 22, 1963(Assassinated)
35 Lyndon B. Johnson November 22, 1963 January 20, 1969
36 Richard Nixon January 20, 1969 August 9, 1974(Resigned)
37 Gerald Ford August 9, 1974 January 20, 1977
38 Jimmy Carter January 20, 1977 January 20, 1981
39 Ronald Reagan January 20, 1981 January 20, 1989
40 George H. W. Bush January 20, 1989 January 20, 1993
41 Bill Clinton January 20, 1993 January 20, 2001
42 George W. Bush January 20, 2001 January 20, 2009
43 Barack Obama January 20, 2009 January 20, 2017
44 Donald Trump January 20, 2017 Incumbent
Upvotes: 0
Reputation: 2806
The find_all
function can return an empty list, which can lead you to getting an error.
You can simple check this:
all_dates = []
for container in containers:
date_container = container.find_all('span', attrs={'class': 'date'})
all_dates.extend([date.text for date in date_container])
Upvotes: 1
Reputation: 816
As you have last lines of code, that store all spans of dates on first table "wikitable", you can make list comprehension:
date = [x.text for x in container.find_all('span' , attrs = {'class': 'date'})]
print(date)
Which will print:
['April 30, 1789', 'March 4, 1797', 'March 4, 1797', 'March 4, 1801', 'March 4, 1801'...
Upvotes: 0