Reputation: 596
trying to collect the tables from here : https://en.wikipedia.org/wiki/List_of_English_monarchs as follow:
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_English_monarchs"
spacer = lambda s: s.replace('\xa0', ' ').replace('[q]', ' ').replace('\u2009',' ')
dfs = pd.read_html(url,attrs={"class":'wikitable'},converters={'Name':spacer,
'Birth':spacer,
'Marriages':spacer,
'Death':spacer})
It works really good except it seems that when there is a <br> text <br> it is not adding a whitespace, for example the first item in the first column "Name":
'Edward the Elder26 October 899–17 July 924(24 years, 266 days)'
where it should have been
'Edward the Elder 26 October 899 – 17 July 924 (24 years, 266 days)'
the end goal is to have the ability to extract the dates from that column
Upvotes: 1
Views: 629
Reputation: 24928
Maybe something like this:
kings = requests.get(url)
df = pd.read_html(kings.text.replace('<br />',' '))
#using the first column as example
print(df[0]['Name'])
Output :
0 Alfred the Great (King of Wessex from 871) c. 886 – 26 October 899
1 Edward the Elder 26 October 899 – 17 July 924 (24 years, 266 days)
Upvotes: 1