Reputation: 415
Simple enough question, but I'm guessing the answer is "No":
I have a HTML table that I'm reading in with pandas.read_html just fine. But some cells (columns) have, say, images in them, or lists, or other formatting that obviously read_html throws away. I obviously don't expect pandas to parse any of that, but is there any way to get it to return the raw HTML as, say, a string in the cell of the DataFrame so I can parse it on my own?
EXAMPLE:
<table>
<th>Column 1</th>
<th>Column 2</th>
<tr>
<td>Cell1</td>
<td>Cell2 <img src="http://www.link.com/image.jpg /></td>
</tr>
<tr>
<td>Cell3</td>
<td>Cell4 <img src="http://www.website.com/picture.gif /></td>
</tr>
</table>
If Pandas was to parse this, I'd probably just get "Cell2" and "Cell4" from column2. What I'd like to do is somehow get the entire contents of the cell, including the [raw?] HTML contents of the <img>
tag. I can then parse them on my own.
Upvotes: 6
Views: 2576
Reputation: 19259
There are no options for the pd.read_html
function that do what you want. So when I tried to get a list of business names from the California Department of State website, I get everything except the name when Pandas automatically parses the HTML:
>>> bizname = 'poss'
>>> url = f'https://businesssearch.sos.ca.gov/CBS/SearchResults?filing=&SearchType=CORP&SearchCriteria={bizname}&SearchSubType=Begins'
>>> df = pd.read_html(url)[0]
>>> df
Entity Number Registration Date Status Entity Name Jurisdiction Agent for Service of Process
0 C2645412 04/02/2004 ACTIVE View details for entity number 02645412 POSSU... GEORGIA ERESIDENTAGENT, INC. (C2702827)
1 C0786330 09/22/1976 DISSOLVED View details for entity number 00786330 POSSU... CALIFORNIA I. HALPERN
2 C2334141 03/01/2001 FTB SUSPENDED View details for entity number 02334141 POSSU... CALIFORNIA CLAIR G BURRILL
3 C0658630 11/08/1972 FTB SUSPENDED View details for entity number 00658630 POSSU... CALIFORNIA NaN
4 C1713121 09/23/1992 FTB SUSPENDED View details for entity number 01713121 POSSU... CALIFORNIA LAWRENCE J. TURNER
5 C1207820 08/05/1983 DISSOLVED View details for entity number 01207820 POSSU... CALIFORNIA R L CARL
6 C3921531 06/27/2016 ACTIVE View details for entity number 03921531 POSSU... CALIFORNIA REGISTERED AGENTS INC (C3365816)
The website hides business names behind a button.
But you can use requests
to download the raw html.
Then you can use bs4
to extract the raw HTML table as well as any particular row (<tr>
) or cell (<td>
) that you want.
>>> soup = bs4.BeautifulSoup(requests.get(url).text)
>>> table = soup.find('table').findAll('tr')
>>> names = []
... for row in table:
... names.append(getattr(row.find('button'), 'contents', [''])[0].strip())
>>> names
['',
'POSSUM FILMS, INC',
'POSSUM INC.',
'POSSUM MEDIA, INC.',
'POSSUM POINT PRODUCTIONS, INC.',
'POSSUM PRODUCTIONS, INC.',
'POSSUM-BILITY EXPRESS, INCORPORATED',
]
>>> df['Entity Name'] = names[1:]
>>> df['Entity Name'] = names[1:]
>>> df
Entity Number Registration Date Status Entity Name Jurisdiction Agent for Service of Process
0 C2645412 04/02/2004 ACTIVE POSSUM FILMS, INC GEORGIA ERESIDENTAGENT, INC. (C2702827)
1 C0786330 09/22/1976 DISSOLVED POSSUM INC. CALIFORNIA I. HALPERN
2 C2334141 03/01/2001 FTB SUSPENDED POSSUM MEDIA, INC. CALIFORNIA CLAIR G BURRILL
3 C0658630 11/08/1972 FTB SUSPENDED POSSUM POINT PRODUCTIONS, INC. CALIFORNIA NaN
4 C1713121 09/23/1992 FTB SUSPENDED POSSUM PRODUCTIONS, INC. CALIFORNIA LAWRENCE J. TURNER
5 C1207820 08/05/1983 DISSOLVED POSSUM-BILITY EXPRESS, INCORPORATED CALIFORNIA R L CARL
6 C3921531 06/27/2016 ACTIVE POSSUMS WELCOME CALIFORNIA REGISTERED AGENTS INC (C33658
Doing it this way doesn't process the header correctly, so don't forget to ignore the first row, if you need to.
Upvotes: 3