Pyd
Pyd

Reputation: 6159

Pandas read_html not reading text properly

I have the below text:

text = """<table class="table table-striped">\n <thead>\n <tr>\n <th data-field="placement">Placement</th>\n <th data-field="production">Production</th>\n <th data-field="application">Eng.Vol.</th>\n <th data-field="body">Body No</th>\n <th data-field="eng">Eng No</th>\n <th data-field="eng">Notes</th>\n </tr>\n <tr>\n <td data-field="placement">Front Stabilizer</td>\n <td data-field="production">Oct 16~</td>\n <td data-field="application">1.5 L</td>\n <td data-field="body">HRW18</td>\n <td data-field="eng">L15BY</td>\n <td data-field="note" class="">\n Pos:Left/Right </td>\n </tr>\n <tr>\n <td data-field="placement">Front Stabilizer</td>\n <td data-field="production">Oct 16~</td>\n <td data-field="application">1.5 L</td>\n <td data-field="body">HRW18 LHD</td>\n <td data-field="eng">L15BY</td>\n <td data-field="note" class="">\n Pos:Left/Right </td>\n </tr>\n <tr>\n <td data-field="placement">Front Stabilizer</td>\n <td data-field="production">Oct 16~</td>\n <td data-field="application">1.5 L</td>\n <td data-field="body">HRW28</td>\n <td data-field="eng">L15BY</td>\n <td data-field="note" class="">\n Pos:Left/Right </td>\n </tr>\n <tr>\n <td data-field="placement">Front Stabilizer</td>\n <td data-field="production">Oct 16~</td>\n <td data-field="application">2.0 L</td>\n <td data-field="body">HRW38 RHD</td>\n <td data-field="eng">R20A9</td>\n <td data-field="note" class="">\n Pos:Left/Right </td>\n </tr>\n </thead>\n </table>"""

this HTML text is properly closed with table tag, and has all required tags. still pandas is not reading as a table.

code:

pd.read_html(text)

output:

[Empty DataFrame
 Columns: [(Placement, Front Stabilizer, Front Stabilizer, Front Stabilizer, Front Stabilizer), (Production, Oct 16~, Oct 16~, Oct 16~, Oct 16~), (Eng.Vol., 1.5 L, 1.5 L, 1.5 L, 2.0 L), (Body No, HRW18, HRW18 LHD, HRW28, HRW38 RHD), (Eng No, L15BY, L15BY, L15BY, R20A9), (Notes, Pos:Left/Right, Pos:Left/Right, Pos:Left/Right, Pos:Left/Right)]
 Index: []]```


Upvotes: 1

Views: 309

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150735

Your table is wrapped inside <thead></thead>. It's understandable that pandas interprete everything as the columns. Let's try:

tmp=pd.read_html(text)[0]

pd.DataFrame(tmp.columns.to_frame().values)

Output:

    0           1                 2                 3                 4
--  ----------  ----------------  ----------------  ----------------  ----------------
 0  Placement   Front Stabilizer  Front Stabilizer  Front Stabilizer  Front Stabilizer
 1  Production  Oct 16~           Oct 16~           Oct 16~           Oct 16~
 2  Eng.Vol.    1.5 L             1.5 L             1.5 L             2.0 L
 3  Body No     HRW18             HRW18 LHD         HRW28             HRW38 RHD
 4  Eng No      L15BY             L15BY             L15BY             R20A9
 5  Notes       Pos:Left/Right    Pos:Left/Right    Pos:Left/Right    Pos:Left/Right

Upvotes: 1

Related Questions