Reputation: 71
I'm currently building a program that parses through wikipedia to display a country's mountains on a map.
I've been able to locate the url of interest, however I'm having trouble redirecting to the new url (where all the desired data lies).
Any and all suggestions including uses of other libraries is greatly appreciated!!
import requests
from bs4 import BeautifulSoup
from csv import writer
import urllib3
#Requests country name from user
user_input=input('Enter Country:')
fist_letter=user_input[0:1].upper()
country=fist_letter+user_input[1:] #takes the country name and capatalizes
the first letter
#Request response for wikipedia parse
response=requests.get('https://en.wikipedia.org/wiki/Category:
Lists_of_mountains_by_country')
bs=BeautifulSoup(response.text,'html.parser')
#country query
for content in bs.find_all(class_='mw-category')[1]:
category_letter=content.find('h3')
#Locates target category to find the country of interest
if fist_letter in category_letter:
country_lists=category_letter.find_next_sibling('ul')
#Locates the country of interest from the lists of countries in target
#category
target=country_lists.find('li',text="List of mountains in
"+str(country))
#Grabs the link which will redirect to the page containing the list of
#mountains for the country of interest.
target_link=target.find('a')
link=target_link.get('href')
new_link='https://enwikipedia.org'+link
#Attempts to redirect to the target link
new_response=requests.get(new_link)
BS=BeautifulSoup(new_response.text,'html.parser')
mountain_list=content.find('tbody')
print(mountain_list)
else:
pass
Upvotes: 1
Views: 243
Reputation: 14906
I like to parse HTML via Pythons string split()
and find()
. Splitting with only a single cut allows one to get a left & right result, and simply take either with an array-syntax notation, e.g.: html_str.split('<a href="', 1)[1]
Anyway, once the code splits out the correct URL, it's just a matter of re-parsing it similarly. Oh, and it might be worthwhile to check for HTTP errors.
import requests
import urllib3
#Requests country name from user
user_input = input('Enter Country:')
country = user_input.strip().lower().capitalize()
#Request response for wikipedia parse
response = requests.get('https://en.wikipedia.org/wiki/Category:Lists_of_mountains_by_country')
response_body = str( response.content, "utf-8" )
# Find the "By Country" section in the HTML result
# This section begins at the Title "Lists of mountains by country"
country_section = response_body.split( 'Pages in category "Lists of mountains by country"' )[1]
search_term = "in_" + country
if ( country_section.find( search_term ) != -1 ):
# each country URL begins "<li><a href="/wiki/List_of_mountains_..."
country_urls = country_section.split('<li><a href="')
for url in country_urls:
if ( url.find( search_term ) != -1 ):
# The URL ends "..._in_Uganda" title="List o..."
# Split off the Right-Side text
found_url = "https://en.wikipedia.org" + url.split('" title=')[0]
print( "DEBUG: URL Is [" + found_url + "]" )
## Now fetch the country-url
response = requests.get( found_url )
response_body = str( response.content, "utf-8" )
### TODO - process mountain list
else:
print( "That country [" + country + "] does not have an entry" )
Upvotes: 1
Reputation: 98941
https://enwikipedia.org
shouldn't it be https://en.wikipedia.org
?
Anyway, it would be easier to add just the country name to:
https://en.wikipedia.org/wiki/Category:Lists_of_mountains_of_**COUNTRYNAME**
Upvotes: 1