Reputation: 47
I am trying to scrape a university world ranking website; however I have trouble extracting one of the keys without its html tags.
I get <div class="td-wrap"> <a href="/universities/massachusetts-institute-technology-mit" class="uni-link">Massachusetts Institute of Technology (MIT) </a></div>
I'd like to get: Massachusetts Institute of Technology (MIT)
Here is how I parse the data:
def parser_page(json):
if json:
items = json.get('data')
for i in range(len(items)):
item = items[i]
qsrank = {}
if "=" in item['rank_display']:
rk_str = str(item['rank_display']).split('=')[-1]
qsrank['rank_display'] = rk_str
else:
qsrank['rank_display'] = item['rank_display']
qsrank['title'] = item['title']
qsrank['region'] = item['region']
qsrank['score'] = item['score']
yield qsrank
More information, here is how the keys are presented:
https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3740566.txt?1624879808?v=1625562924528
Everything is fine beside the title as you can see above, I am trying to extract the data without the tags around it.
Upvotes: 1
Views: 1193
Reputation: 195573
To get the text from Json file, you can use beautifulsoup
. For example:
import json
from bs4 import BeautifulSoup
json_data = r"""{
"core_id": "624",
"country": "Italy",
"city": "Trieste",
"guide": "",
"nid": "297237",
"title": "<div class=\"td-wrap\"><a href=\"\/universities\/university-trieste\" class=\"uni-link\">University of Trieste<\/a><\/div>",
"logo": "\/sites\/default\/files\/university-of-trieste_624_small.jpg",
"score": "",
"rank_display": "651-700",
"region": "Europe",
"stars": "",
"recm": "0--"
}"""
json_data = json.loads(json_data)
soup = BeautifulSoup(json_data["title"], "html.parser")
print(soup.get_text(strip=True))
Prints:
University of Trieste
Upvotes: 1