IvelinI
IvelinI

Reputation: 21

Trying to parse website with Python: I want to save to csv and run on multiple page

I am new to Python and webscraping. I tried to do it myself but i got stuck.

I would like to web scrape efinancinalcareers.com for job offers. I wrote the code to get to the elements of the html, I can print them on the console, but i need help to save the data to csv and to run the script on all result pages. Here is the code:

import requests
from bs4 import BeautifulSoup
import csv
import datetime
print datetime.datetime.now()
url = "http://www.efinancialcareers.com/search?page=1&sortBy=POSTED_DESC&searchMode=DEFAULT_SEARCH&jobSearchId=RUJFMEZDNjA2RTJEREJEMDcyMzlBQ0YyMEFDQjc1MjUuMTQ4NTE5MDY3NTI0Ni4tMTQ1Mjc4ODU3NQ%3D%3D&updateEmitter=SORT_BY&filterGroupForm.includeRefreshed=true&filterGroupForm.datePosted=OTHER"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,'lxml')

f = open ('EFINCAR.txt', 'w')
f.write('Job name;')
f.write('Salary;')
f.write('Location;')
f.write('Position;')
f.write('Company')
f.write('Date')
f.write('\n')


# Job name
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for details in container.find_all('li',{'class':'jobPreview well'}):
        for h3 in details.find_all('h3'):
            job=h3.find('a')
        print(job.text)

# Salary
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            salary=details.find('li',{'class':'salary'})
            print(salary.text)

# Location
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            location=details.find('li',{'class':'location'})
            print(location.text)

# Position
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            position=details.find('li',{'class':'position'})
            print(position.text)

# Company
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            company=details.find('li',{'class':'company'})
            print(company.text)

# Date
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            datetext=details.find('li',{'class':'updated'})
            print(datetext.text)

#       Attributes assignment section

#       Job Name
job_name = job.get_text()
f.write(job_name.encode('utf-8'))
f.write(';')

#       Salary

salary_name = salary.get_text()
f.write(salary_name.encode('utf-8'))
f.write(';')

#       location
location_name = location.get_text()
location_name = location_name.strip()
f.write(location_name.encode('utf-8'))
f.write(';')

#       position
position_name = position.get_text()
position_name = position_name.strip()
f.write(position_name.encode('utf-8'))
f.write(';')

#       company
company_name = company.get_text()
company_name = company_name.strip()
f.write(company_name.encode('utf-8'))
f.write(';')

#       Datetext
datetext_name = datetext.get_text()
datetext_name = datetext_name.strip()
f.write(datetext_name.encode('utf-8'))
f.write(';')
f.write('\n')

f.close()
**strong text**
print('Finished!')

Upvotes: 0

Views: 152

Answers (1)

bpavlov
bpavlov

Reputation: 1090

Welcome to StackOverflow!

Let's have a look at your code.

You have six three-level-nested for loops (total 18 for loops). As you can see they are almost the same and contain:

for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):

So instead of writing six times the same code - you could write it only once and do everything in it. For example this:

# Salary
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            salary=details.find('li',{'class':'salary'})
            print(salary.text)

# Location
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            location=details.find('li',{'class':'location'})
            print(location.text)

Could be written as:

# Salary & Location
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            location=details.find('li',{'class':'location'})
            salary=details.find('li',{'class':'salary'})
            print(salary.text)
            print(location.text)

It is considered good practice to write DRY (don't repeat yourself) code.

The reason you are seeing the parsed html data in the console is that you have print(XXXXX) calls in your for loops. When the element is parsed it is printed in the console.

You are NOT seeing data in your text file (EFINCAR.txt) because your f.write(xxxx) calls are OUTSIDE your for loops. You should move them next to print(xxxx) calls.

For example:

# Salary
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            salary=details.find('li',{'class':'salary'})
            print(salary.text)
            salary_name = salary.get_text()
            f.write(salary_name.encode('utf-8'))
            f.write(';')

When you do that you will notice that there is something wrong with parsing the html.

HINT: Be careful with tabs, new lines and whitespaces.

In order to save data to csv and do it right you should remove them when parsing. Of course you could skip that but the result may look ugly.

Finally if you want to run your script for couple of pages or all pages you should check how the number of pages reflects your request URL. For example in your case for page 1 you have:

http://www.efinancialcareers.com/search?page=1XXXXXXXXXXXXXXX

for page 2 you have:

http://www.efinancialcareers.com/search?page=2XXXXXXXXXXXXXXX

That means that you should run your code with URL = http://www.efinancialcareers.com/search?page={NUMBER_OF_PAGE}XXXXXXXXXXXXXXX

Where NUMBER_OF_PAGE is from 1 to LAST_PAGE. So instead of hard-coding url you could simply make a loop and generate url as described above.

Upvotes: 1

Related Questions