Latent
Latent

Reputation: 596

panda read_html ignore <br> and concatenate strings

trying to collect the tables from here : https://en.wikipedia.org/wiki/List_of_English_monarchs as follow:

 import pandas as pd
 url = "https://en.wikipedia.org/wiki/List_of_English_monarchs"
 spacer = lambda s: s.replace('\xa0', ' ').replace('[q]', ' ').replace('\u2009',' ')

 dfs = pd.read_html(url,attrs={"class":'wikitable'},converters={'Name':spacer,
                                                               'Birth':spacer,
                                                               'Marriages':spacer,
                                                               'Death':spacer})

It works really good except it seems that when there is a <br> text <br> it is not adding a whitespace, for example the first item in the first column "Name":

'Edward the Elder26 October 899–17 July 924(24 years, 266 days)'
where it should have been
'Edward the Elder 26 October 899 – 17 July 924 (24 years, 266 days)'

the end goal is to have the ability to extract the dates from that column

Upvotes: 1

Views: 629

Answers (1)

Jack Fleeting
Jack Fleeting

Reputation: 24928

Maybe something like this:

kings = requests.get(url)
df = pd.read_html(kings.text.replace('<br />',' '))
#using the first column as example
print(df[0]['Name'])

Output :

0                                               Alfred the Great (King of Wessex from 871) c. 886 – 26 October 899
1                                               Edward the Elder 26 October 899 – 17 July 924 (24 years, 266 days)

Upvotes: 1

Related Questions