Yacer Saoud
Yacer Saoud

Reputation: 47

extract data without html tags

I am trying to scrape a university world ranking website; however I have trouble extracting one of the keys without its html tags.

I get <div class="td-wrap"> <a href="/universities/massachusetts-institute-technology-mit" class="uni-link">Massachusetts Institute of Technology (MIT) </a></div>

I'd like to get: Massachusetts Institute of Technology (MIT)

Here is how I parse the data:

def parser_page(json):
    if json:
        items = json.get('data')
        for i in range(len(items)):
            item = items[i]
            qsrank = {}
            if "=" in item['rank_display']:
                rk_str = str(item['rank_display']).split('=')[-1]
                qsrank['rank_display'] = rk_str
            else:
                qsrank['rank_display'] = item['rank_display']
            qsrank['title'] = item['title']
            qsrank['region'] = item['region']
            qsrank['score'] = item['score']

            yield qsrank 

More information, here is how the keys are presented:

https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3740566.txt?1624879808?v=1625562924528

Everything is fine beside the title as you can see above, I am trying to extract the data without the tags around it.

Upvotes: 1

Views: 1193

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195573

To get the text from Json file, you can use beautifulsoup. For example:

import json
from bs4 import BeautifulSoup

json_data = r"""{
      "core_id": "624",
      "country": "Italy",
      "city": "Trieste",
      "guide": "",
      "nid": "297237",
      "title": "<div class=\"td-wrap\"><a href=\"\/universities\/university-trieste\" class=\"uni-link\">University of Trieste<\/a><\/div>",
      "logo": "\/sites\/default\/files\/university-of-trieste_624_small.jpg",
      "score": "",
      "rank_display": "651-700",
      "region": "Europe",
      "stars": "",
      "recm": "0--"
    }"""

json_data = json.loads(json_data)
soup = BeautifulSoup(json_data["title"], "html.parser")

print(soup.get_text(strip=True))

Prints:

University of Trieste

Upvotes: 1

Related Questions