Addem
Addem

Reputation: 3919

Using Python to Automate Web Searches

I'd like to automate what I've been doing by going to a website and repeatedly searching. In particular I've been going to This Website, scrolling down near the bottom, clicking the "Upcoming" tab, and searching various cities.

I'm a novice at Python and I'd like to be able to just type a list of cities to enter for the search, and get an output that aggregates all of the search results. So for instance, the following functionality would be great:

cities = ['NEW YORK, NY', 'LOS ANGELES, CA']
print getLocations(cities)

and it would print

Palm Canyon Theatre PALM SPRINGS, CA    01/22/2016  02/07/2016
...

and so on, listing all of the search results for a 100-mile radius around each of the cities entered.

I've tried looking at the documentation for the requests module from Apache2 and I ran

r = requests.get('http://www.tamswitmark.com/shows/anything-goes-beaumont-1987/')
r.content

And it printed all of the HTML of the webpage, so that sounds like some minor victory although I'm not sure what to do with it.

Help would be greatly appreciated, thank you.

Upvotes: 0

Views: 10129

Answers (1)

FriC
FriC

Reputation: 816

You have two questions rolled into one, so here is a partial answer to start you off. The first task concerns HTML parsing, so let's use the python libraries: requests, and beautifulsoup4 (pip install beautifulsoup4 in case you haven't already).

import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.tamswithmark.com/shows/anything-goes-beaumont-1987/')
soup = BeautifulSoup(r.content, 'html.parser')
rows = soup.findAll('tr', {"class": "upcoming_performance"})

soup is navigable data structure of the page content. We use the findAll method on soup to extract the 'tr' elements with class 'upcoming_performance'. A single element in rows looks like:

print(rows[0])  # debug statement to examine the content
"""
<tr class="upcoming_performance" data-lat="47.6007" data-lng="-120.655" data-zip="98826">
<td class="table-margin"></td>
<td class="performance_organization">Leavenworth Summer Theater</td>
<td class="performance_city-state">LEAVENWORTH, WA</td>
<td class="performance_date-from">07/15/2015</td>
<td class="performance_date_to">08/28/2015</td>
<td class="table-margin"></td>
</tr>
"""

Now, let's extract the data from these rows into our own data structure. For each row, we will create a dictionary for that performance.

The data-* attributes of each tr element are available through dictionary key lookup.

The 'td' elements inside each tr element can be accessed using the .children (or .contents) attribute.

performances = []  # list of dicts, one per performance
for tr in rows:
    # extract the data-* using dictionary key lookup on tr 
    p = dict(
        lat=float(tr['data-lat']),
        lng=float(tr['data-lng']),
        zipcode=tr['data-zip']
    )
    # extract the td children into a list called tds
    tds = [child for child in tr.children if child != "\n"]
    # the class of each td indicates what type of content it holds
    for td in tds:
       key = td['class'][0] # get first element of class list
       p[key] = td.string  # get the string inside the td tag
    # add to our list of performances
    performances.append(p)

At this point, we have a list of dictionaries in performances. The keys in each dict are:

lat : float

lng: float

zipcode: str

performance_city-state: str

performance_organization: str

etc

HTML extraction is done. Your next step is use a mapping API service that compares the distance from your desired location to the lat/lng values in performances. For example, you may choose to use Google Maps geocoding API. There are plenty of existing answered questions on SO to guide you.

Upvotes: 1

Related Questions