Reputation: 45
I am trying to pull the tables off this site. When I load the URL with pd.read_html I get back a series of data frames as expected, but the issue is that the HTML tags that are in the cell of the tables are gone. Is there any way I can rip the tables and keep the HTML that is in the table cells using pandas?
import pandas as pd
df = pd.read_html('http://geppopotamus.info/game/tekken7fr/asuka/data.htm#page_top')
I want the cell to be this
<span class="tooltip" title="すいけい">翠勁
<sup>ヨミ</sup></span><br>
<img src="../lp.bmp" class="c">/上
but I get this
翠勁 ヨミ /上
I have used beautiful soup to parse the HTML then passed the data to pandas by it still strips out the inner HTML.
Upvotes: 1
Views: 1418
Reputation: 84465
pandas read_html will already have parsed your html. As mentioned in comments look at perhaps BeautifulSoup. The following extracts all the table tag html. You can adjust the css selector as required.
import requests
from bs4 import BeautifulSoup
url = 'http://geppopotamus.info/game/tekken7fr/asuka/data.htm#page_top'
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
tables = [str(table) for table in soup.select('table')]
print(tables)
Upvotes: 1