FakeHelicopterPilot
FakeHelicopterPilot

Reputation: 5

How do I deal with empty list items while scraping web data?

I'm trying to scrape data into a CSV file from a website that lists contact information for people in my industry. My code works well until I get to a page where one of the entries doesn't have a specific item.

So for example:

I'm trying to collect

Name, Phone, Profile URL

If there isn't a phone number listed, there isn't even a tag for that field on the page, and my code errors out with

"IndexError: list index out of range"

I'm pretty new to this, but what I've managed to cobble together so far from various youtube tutorials/this site has really saved me a ton of time completing some tasks that would take me days otherwise. I'd appreciate any help that anyone is willing to offer.

I've tried varying if/then statements where if the variable is null, then set the variable to "Empty"

Edit:

I updated the code. I switched to CSS Selectors for more specificity and readability. I also added a try/except to at least bypass the index error, but doesn't solve the problem of incorrect data being stored due to uneven amounts of data for each field. Also, the site I'm trying to scrape is in the code now.

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Firefox()


MAX_PAGE_NUM = 5
MAX_PAGE_DIG = 2

with open('results.csv', 'w') as f:
    f.write("Name, Number, URL \n")

#Run Through Pages
for i in range(1, MAX_PAGE_NUM + 1):
    page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
    website = "https://www.realtor.com/realestateagents/lansing_mi/pg-" + page_num
    driver.get(website)

    Name = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-name.text-bold > a')
    Number = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-phone.hidden-xs.hidden-xxs')
    URL = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-name.text-bold > a')

#Collect Data From Each Page
    num_page_items = len(Name)
    with open('results.csv', 'a') as f:
        for i in range(num_page_items):
            try:
                f.write(Name[i].text.replace(",", ".") + "," + Number[i].text + "," + URL[i].get_attribute('href') + "\n")
                print(Name[i].text.replace(",", ".") + "," + Number[i].text + "," + URL[i].get_attribute('href') + "\n")
            except IndexError:
                f.write("Skip, Skip, Skip \n")
                print("Number Missing")
                continue


driver.close()

If any of the fields I'm trying to collect don't exist on individual listings, I just want the empty field to be filled in as "Empty" on the spreadsheet.

Upvotes: 0

Views: 2166

Answers (1)

chitown88
chitown88

Reputation: 28650

You could use try/except to take care of that. I also opted to use Pandas and BeautifulSoup as I'm more familiar with those.

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
import pandas as pd

MAX_PAGE_NUM = 5
MAX_PAGE_DIG = 2

results = pd.DataFrame()
#Run Through Pages
for i in range(1, MAX_PAGE_NUM + 1):
    page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
    website = "https://www.realtor.com/realestateagents/lansing_mi/pg-" + page_num
    driver.get(website)

    soup = BeautifulSoup(driver.page_source, 'html.parser')
    agent_cards = soup.find_all('div', {'class':'agent-list-card clearfix'})

    for agent in agent_cards:

        try:
            Name = agent.find('div', {'itemprop':'name'}).text.strip().split('\n')[0]
        except:
            Name = None

        try:
            Number = agent.find('div', {'itemprop':'telephone'}).text.strip()
        except:
            Number = None

        try:
            URL = 'https://www.realtor.com/' + agent.find('a', href=True)['href']
        except:
            URL = None

        temp_df = pd.DataFrame([[Name, Number, URL]], columns=['Name','Number','URL'])
        results = results.append(temp_df, sort=True).reset_index(drop=True)
    print('Processed page: %s' %i)

driver.close()

results.to_csv('results.csv', index=False)

Output:

print (results)
                                   Name  ...                                                URL
0                            Nicole Enz  ...  https://www.realtor.com//realestateagents/nico...
1                  Jennifer Worthington  ...  https://www.realtor.com//realestateagents/jenn...
2                      Katherine Keener  ...  https://www.realtor.com//realestateagents/kath...
3                            Erica Cook  ...  https://www.realtor.com//realestateagents/eric...
4   Jeff Thornton, Broker, Assoc Broker  ...  https://www.realtor.com//realestateagents/jeff...
5                   Neal Sanford, Agent  ...  https://www.realtor.com//realestateagents/neal...
6                           Sherree Zea  ...  https://www.realtor.com//realestateagents/sher...
7                       Jennifer Cooper  ...  https://www.realtor.com//realestateagents/jenn...
8                      Charlyn Cosgrove  ...  https://www.realtor.com//realestateagents/char...
9          Kathy Birchen & Chad Dutcher  ...  https://www.realtor.com//realestateagents/kath...
10                        Nancy Petroff  ...  https://www.realtor.com//realestateagents/nanc...
11              The Angela Averill Team  ...  https://www.realtor.com//realestateagents/the-...
12                  Christina Tamburino  ...  https://www.realtor.com//realestateagents/chri...
13                      Rayce O'Connell  ...  https://www.realtor.com//realestateagents/rayc...
14                      Stephanie Morey  ...  https://www.realtor.com//realestateagents/step...
15                         Sean Gardner  ...  https://www.realtor.com//realestateagents/sean...
16                            John Burg  ...  https://www.realtor.com//realestateagents/john...
17                Linda Ellsworth-Moore  ...  https://www.realtor.com//realestateagents/lind...
18                         David Bueche  ...  https://www.realtor.com//realestateagents/davi...
19                       David Ledebuhr  ...  https://www.realtor.com//realestateagents/davi...
20                            Aaron Fox  ...  https://www.realtor.com//realestateagents/aaro...
21                       Kristy Seibold  ...  https://www.realtor.com//realestateagents/kris...
22                        Genia Beckman  ...  https://www.realtor.com//realestateagents/geni...
23                         Angela Bolan  ...  https://www.realtor.com//realestateagents/ange...
24                      Constance Benca  ...  https://www.realtor.com//realestateagents/cons...
25                            Lisa Fata  ...  https://www.realtor.com//realestateagents/lisa...
26                          Mike Dedman  ...  https://www.realtor.com//realestateagents/mike...
27                        Jamie Masarik  ...  https://www.realtor.com//realestateagents/jami...
28                           Amy Yaroch  ...  https://www.realtor.com//realestateagents/amy-...
29                      Debbie McCarthy  ...  https://www.realtor.com//realestateagents/debb...
..                                  ...  ...                                                ...
70                      Vickie Blattner  ...  https://www.realtor.com//realestateagents/vick...
71                      Faith F Steller  ...  https://www.realtor.com//realestateagents/fait...
72                      A.  Jason Titus  ...  https://www.realtor.com//realestateagents/a.--...
73                            Matt Bunn  ...  https://www.realtor.com//realestateagents/matt...
74                           Joe Vitale  ...  https://www.realtor.com//realestateagents/joe-...
75                   Reozom Real Estate  ...  https://www.realtor.com//realestateagents/reoz...
76                        Shane Broyles  ...  https://www.realtor.com//realestateagents/shan...
77                   Megan Doyle-Busque  ...  https://www.realtor.com//realestateagents/mega...
78                         Linda Holmes  ...  https://www.realtor.com//realestateagents/lind...
79                           Jeff Burke  ...  https://www.realtor.com//realestateagents/jeff...
80                        Jim Convissor  ...  https://www.realtor.com//realestateagents/jim-...
81                  Concetta D'Agostino  ...  https://www.realtor.com//realestateagents/conc...
82                     Melanie McNamara  ...  https://www.realtor.com//realestateagents/mela...
83                          Julie Adams  ...  https://www.realtor.com//realestateagents/juli...
84                          Liz Horford  ...  https://www.realtor.com//realestateagents/liz-...
85                         Miriam Olsen  ...  https://www.realtor.com//realestateagents/miri...
86                       Wanda Williams  ...  https://www.realtor.com//realestateagents/wand...
87                         Troy Seyfert  ...  https://www.realtor.com//realestateagents/troy...
88                        Maggie Gerich  ...  https://www.realtor.com//realestateagents/magg...
89                 Laura Farhat Bramson  ...  https://www.realtor.com//realestateagents/laur...
90                      Peter MacIntyre  ...  https://www.realtor.com//realestateagents/pete...
91                        Mark Jacobsen  ...  https://www.realtor.com//realestateagents/mark...
92                             Deb Good  ...  https://www.realtor.com//realestateagents/deb-...
93                 Mary Jane Vanderstow  ...  https://www.realtor.com//realestateagents/mary...
94                           Ben Magsig  ...  https://www.realtor.com//realestateagents/ben-...
95                   Brenna Chamberlain  ...  https://www.realtor.com//realestateagents/bren...
96                  Deborah Cooper, CNS  ...  https://www.realtor.com//realestateagents/debo...
97            Huggler, Bashore & Brooks  ...  https://www.realtor.com//realestateagents/hugg...
98             Jodey Shepardson Custack  ...  https://www.realtor.com//realestateagents/jode...
99              Madaline Alspaugh-Young  ...  https://www.realtor.com//realestateagents/mada...

[100 rows x 3 columns]

Upvotes: 1

Related Questions