JGNT
JGNT

Reputation: 33

Problem with Web Scraping to CSV [AttributeError: 'str' object has no attribute 'text]

I am trying to build an automated web scraper, and I have spent hours watching YT videos and reading stuff here. New to programming (started one month ago) and new to this community...

So, using VScode as my IDE, I followed the format of this code (python and selenium) that actually worked as a web scraper:


from selenium import webdriver
import time
from selenium.webdriver.support.select import Select

with open('job_scraping_multipe_pages.csv', 'w') as file:
    file.write("Job_title, Location, Salary, Contract_type, Job_description \n")
    
driver= webdriver.Chrome()
driver.get('https://www.jobsite.co.uk/')

driver.maximize_window()
time.sleep(1)

cookie= driver.find_element_by_xpath('//button[@class="accept-button-new"]')
try:
    cookie.click()
except:
    pass 

job_title=driver.find_element_by_id('keywords')
job_title.click()
job_title.send_keys('Software Engineer')
time.sleep(1)

location=driver.find_element_by_id('location')
location.click()
location.send_keys('Manchester')
time.sleep(1)

dropdown=driver.find_element_by_id('Radius')
radius=Select(dropdown)
radius.select_by_visible_text('30 miles')
time.sleep(1)

search=driver.find_element_by_xpath('//input[@value="Search"]')
search.click()
time.sleep(2)

for k in range(3):
    titles=driver.find_elements_by_xpath('//div[@class="job-title"]/a/h2')
    location=driver.find_elements_by_xpath('//li[@class="location"]/span')
    salary=driver.find_elements_by_xpath('//li[@title="salary"]')
    contract_type=driver.find_elements_by_xpath('//li[@class="job-type"]/span')
    job_details=driver.find_elements_by_xpath('//div[@title="job details"]/p')

    with open('job_scraping_multipe_pages.csv', 'a') as file:
        for i in range(len(titles)):
            file.write(titles[i].text + "," + location[i].text + "," + salary[i].text + "," + contract_type[i].text + ","+
                      job_details[i].text + "\n")

        
        next=driver.find_element_by_xpath('//a[@aria-label="Next"]')
        next.click()
    file.close()
driver.close()

It worked. I then tried to replicate the results for another website. Instead of hitting the 'next' button, I was able to find a way to cause the ending number of the URL increase by 1. But my problems came from the last parts of the code, giving me AttributeError: 'str' object has no attribute 'text'. Here is the code for the website I was targeting (https://angelmatch.io/pitch_decks/5285) in Python and Selenium:


from selenium import webdriver
import time
from selenium.webdriver.support.select import Select

driver = webdriver.Chrome()


with open('pitchDeckResults2.csv', 'w' ) as file:
    file.write("Startup_Name, Startup_Description, Link_Deck_URL, Startup_Website, Pitch_Deck_PDF, Industries, Amount_Raised, Funding_Round, Year /n")




    for k in range(5285, 5287, 1):
        
        linkDeck = "https://angelmatch.io/pitch_decks/" + str(k)        

        driver.get(linkDeck)
        driver.maximize_window
        time.sleep(2)

        startupName = driver.find_elements_by_xpath('/html/body/div[1]/div[2]/div[2]/div/div/div[1]')
        startupDescription = driver.find_elements_by_xpath('/html/body/div[1]/div[2]/div[2]/div/div/div[3]/p[2]')
        startupWebsite = driver.find_elements_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/p[3]/a')
        pitchDeckPDF = driver.find_elements_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/button/a')
        industries = driver.find_elements_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/a[2]')
        amountRaised = driver.find_elements_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/p[1]/b')
        fundingRound = driver.find_elements_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/a[1]')
        year = driver.find_elements_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/p[2]/b')

        

        with open('pitchDeckResults2.csv', 'a') as file:
            for i in range(len(startupName)):
                file.write(startupName[i].text + "," + startupDescription[i].text + "," + linkDeck[i].text + "," + startupWebsite[i].text + "," + pitchDeckPDF[i].text + "," + industries[i].text + "," + amountRaised[i].text + "," + fundingRound[i].text + "," + year[i].text +"\n")

            time.sleep(1)

        file.close()

driver.close()

I'll appreciate any help! I am trying to get the data into CSV using this technique!

Upvotes: 3

Views: 363

Answers (1)

Vova
Vova

Reputation: 3541

And you're doing great, honestly. The only thing and why error appears, you're trying to get .text variable from string type value. str type in python doesn't have any text variable. Moreover you're trying to iterate it by [i] what can reach 'list index out of range.' exception. What you're trying to put on the place of linkDeck[i].text, might be page.title?or what?

By the way, you shouldn't close file when you use with open() statement. It's context manager, which makes it without you after you leave file out

add added columns to maxamize_window() and remove 1 file opening, and added just link:

import time

from selenium import webdriver

driver = webdriver.Chrome()
delimeter = ';'
with open('pitchDeckResults2.csv', 'w+') as _file:
    _l = ['Startup_Name', 'Startup_Description', 'Link_Deck_URL', 'Startup_Website', 'Pitch_Deck_PDF', 'Industries',
          'Amount_Raised', 'Funding_Round', 'Year \n']
    _file.write(delimeter.join(_l))
    for k in range(5285, 5287, 1):
        linkDeck = "https://angelmatch.io/pitch_decks/" + str(k)

        driver.get(linkDeck)
        time.sleep(1)

        startupName = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[2]/div/div/div[1]')
        startupDescription = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[2]/div/div/div[3]/p[2]')
        startupWebsite = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/p[3]/a')
        pitchDeckPDF = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/button/a')
        industries = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/a[2]')
        amountRaised = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/p[1]/b')
        fundingRound = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/a[1]')
        year = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/p[2]/b')

        all_elements = [startupName.text, startupDescription.text, linkDeck, startupWebsite.text, pitchDeckPDF.text,
                        industries.text, amountRaised.text, fundingRound.text, f"{year.text}\n"]
        _str = delimeter.join(all_elements)
        _file.write(_str)

driver.close()

Might I have missed smth, let me know

Upvotes: 1

Related Questions