Zanam
Zanam

Reputation: 4807

Pandas read_html unable to read tables

I am using the following code:

import requests, pandas as pd
from bs4 import BeautifulSoup

if __name__ == '__main__':
    url = 'https://www.har.com/homedetail/6408-burgoyne-rd-157-houston-tx-77057/3380601'
    list_of_dataframes = pd.read_html(url)

However, in the list_of_dataframes there is no school information which is available at the bottom of the page in the url above.

I was wondering how to get the following information in a dataframe as below:

School                         Stars  Rating
BRIARGROVE Elementary School   4      Good
TANGLEWOOD Middle School       4      Good
WISDOM High School High        3      Average

TIA

Upvotes: 2

Views: 275

Answers (1)

baduker
baduker

Reputation: 20042

You can't get that school info with pandas because this is not a table. These are just regular divs so you have to parse the HTML and then dump the data to pd.DataFrame.

Here's how to do it:

import pandas as pd
import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':
    url = 'https://www.har.com/homedetail/6408-burgoyne-rd-157-houston-tx-77057/3380601'
    soup = BeautifulSoup(requests.get(url).text, "lxml").find("div", {"id": "SCHOOLS"})
    schools = soup.find_all("div", class_="border_row")
    schools_data = []
    for school in schools:
        name = school.find("a").getText()
        stars = len([i for i in school.find_all("img") if "star" in i["src"]])
        rating = school.getText().split()[-2]
        schools_data.append(
            [
                name,
                stars,
                rating,
            ]
        )
    print(pd.DataFrame(schools_data, columns=["School", "Stars", "Rating"]))

Output:

                         School  Stars   Rating
0  BRIARGROVE Elementary School      4     Good
1      TANGLEWOOD Middle School      4     Good
2            WISDOM High School      3  Average

Upvotes: 4

Related Questions