Daniel
Daniel

Reputation: 691

Nested, Same-Level For Loop, Output to List

I am having trouble appending data into a list as I iterate through the following:

import urllib
import urllib.request
from bs4 import BeautifulSoup
import pandas

def make_soup(url):
    thepage = urllib.request.urlopen(url)
    thepage.addheaders = [('User-Agent', 'Mozilla/5.0')]
    soupdata = BeautifulSoup(thepage, 'html.parser')
    return soupdata

soup = make_soup('https://www.wellstar.org/locations/pages/default.aspx')

locationdata = []

for table in soup.findAll('table', class_ = 's4-wpTopTable'):
   for name in table.findAll('div', 'PurpleBackgroundHeading'):
       name = name.get_text(strip = True)
   for loc_type in table.findAll('h3', class_ = 'WebFont SpotBodyGreen'):
       loc_type = loc_type.get_text()
   for address in table.findAll('div', class_ = ['WS_Location_Address', 'WS_Location_Adddress']):
       address = address.get_text(strip = True, separator = ' ')
       locationdata.append([name, loc_type, address])

df = pandas.DataFrame(columns = ['name', 'loc_type', 'address'], data = locationdata)
print(df)

The produced dataframe includes all unique addresses, however only the last possible text corresponding to the name.

For example, even though 'WellStar Windy Hill Hospital' is the last hospital within the hospital category/type, it appears as the name for all hospitals. If possible, I prefer a list.append solution as I have several more, similar steps to go through to finalize this project.

Upvotes: 0

Views: 458

Answers (1)

mechanical_meat
mechanical_meat

Reputation: 169484

This is occurring because you're looping through all the names and loc_types before you're appending to locationdata.

You can instead do:

import itertools as it
from pprint import pprint as pp

for table in soup.findAll('table', class_='s4-wpTopTable'):
  names = [name.get_text(strip=True) for 
           name in table.findAll('div', 'PurpleBackgroundHeading')]
  loc_types = [loc_type.get_text() for 
               loc_type in table.findAll('h3', class_='WebFont SpotBodyGreen')]
  addresses = [address.get_text(strip=True, separator=' ') for 
               address in table.findAll('div', class_=['WS_Location_Address',  
                                                       'WS_Location_Adddress'])]

for name, loc_type, address in it.izip_longest(names,loc_types,addresses):
  locationdata.append([name, loc_type, address])

Result:

>>> pp.pprint(locationdata)
[[u'WellStar Urgent Care in Acworth',
  u'WellStar Urgent Care Centers',
  u'4550 Cobb Parkway NW Suite 101 Acworth, GA 30101 770-917-8140'],
 [u'WellStar Urgent Care in Kennesaw',
  None,
  u'3805 Cherokee Street Kennesaw, GA 30144 770-426-5665'],
 [u'WellStar Urgent Care in Marietta - Delk Road',
  None,
  u'2890 Delk Road Marietta, GA 30067 770-955-8620'],
 [u'WellStar Urgent Care in Marietta - East Cobb',
  None,
  u'3747 Roswell Road Ne Suite 107 Marietta, GA 30062 470-956-0150'],
 [u'WellStar Urgent Care in Marietta - Kennestone',
  None,
  u'818 Church Street Suite 100 Marietta, GA 30060 770-590-4190'],
 [u'WellStar Urgent Care in Marietta - Sandy Plains Road',
  None,
  u'3600 Sandy Plains Road Marietta, GA 30066 770-977-4547'],
 [u'WellStar Urgent Care in Smyrna',
  None,
  u'4480 North Cooper Lake Road SE Suite 100 Smryna, GA 30082 770-333-1300'],
 [u'WellStar Urgent Care in Woodstock',
  None,
  u'120 Stonebridge Parkway Suite 310 Woodstock, GA 30189 678-494-2500']]

Upvotes: 1

Related Questions