Reputation: 79
I'm trying to convert a scraped HTML table into a dataframe in python using pandas read_html
. The problem is that read_html
brings in a column of my data without breaks, which makes the content of those cells hard to parse. In the original HTML, each "word" in the column is separated by a break. Is there a way to keep this formatting or otherwise keep the "words" separated when converting to a data frame?
import requests
from bs4 import BeautifulSoup
import pandas as pd
url="https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/"
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content, "lxml")
voc_html = soup.find("table")
#convert to dataframe
voc_df = pd.read_html(str(voc_html))[0]
#retain list of variants
voc_list=voc_df['Pango lineages']
example from voc_list
where separate items are smushed together:
voc_list[1]
`B.1.351\xa0B.1.351.2B.1.351.3`
what I would like it to look like: B.1.3510 B.1.351.2 B.1.351.3
(or have each item on its own row)
excerpt from original html version which includes breaks:
<td style="width:13%;background-color:#69d4ef;text-align:left;vertical-align:middle;">Beta <br/></td><td style="width:12.9865%;background-color:#69d4ef;text-align:left;"><p>B.1.351 <br/>B.1.351.2<br/>B.1.351.3</p></td>
Thanks for any guidance!
Upvotes: 3
Views: 2674
Reputation: 2670
Maybe...
import pandas as pd
import requests
url = r'https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/'
page = requests.get(url)
table = pd.read_html(page.text.replace('<br />',' '))
df = table[0]
Outputs:
WHO label Pango lineages GISAID clade Nextstrain clade \
0 Alpha B.1.1.7 GRY 20I (V1)
1 Beta B.1.351 B.1.351.2 B.1.351.3 GH/501Y.V2 20H (V2)
2 Gamma P.1 P.1.1 P.1.2 GR/501Y.V3 20J (V3)
3 Delta B.1.617.2 AY.1 AY.2 G/478K.V1 21A
Additional amino acid changes monitored* Earliest documented samples \
0 +S:484K +S:452R United Kingdom, Sep-2020
1 +S:L18F South Africa, May-2020
2 +S:681H Brazil, Nov-2020
3 +S:417N India, Oct-2020
Date of designation
0 18-Dec-2020
1 18-Dec-2020
2 11-Jan-2021
3 VOI: 4-Apr-2021 VOC: 11-May-2021
print(df)
Equally you could replace the <br />
with \n
.
Upvotes: 8