pyrish
pyrish

Reputation: 65

Pandas Dataframe - Issue when writing Header

After doing some scraping, I get all my data, store it in a pandas df but I'm having an issue when writing the header. Since I'm scraping many pages of a job site, I had to create a loop that iterates through the pages and gets a different df per page, and when it's done, I save the df to a CSV file.

The problem is that the header will always be written once per iteration and I would like to only be written once..

I have tried all solutions presented on this previous question here but I still I'm not able to come with a solution to this problem. I apologize if this is a silly question but I'm still learning and loving the journey. Any help, tip, advise would be very helpful.

Here's my code:

def find_data(soup):
    l = []
    for div in soup.find_all('div', class_ = 'js_result_container'):
        d = {}
        try:
            d["Company"] = div.find('div', class_= 'company').find('a').find('span').get_text()
            d["Date"] = div.find('div', {'class':['job-specs-date', 'job-specs-date']}).find('p').find('time').get_text()
            pholder = div.find('div', class_= 'jobTitle').find('h2').find('a')
            d["URL"] = pholder['href']
            d["Role"] = pholder.get_text().strip()
            l.append(d)
        except:
            pass
    df = pd.DataFrame(l)
    df = df[['Date', 'Company', 'Role', 'URL']]
    df = df.dropna()
    df = df.sort_values(by=['Date'], ascending=False)
    df.to_csv("csv_files/pandas_data.csv", mode='a', header=True, index=False)

if __name__ == '__main__':

    f = open("csv_files/pandas_data.csv", "w")
    f.truncate()
    f.close()

    query = input('Enter role to search: ')
    max_pages = int(input('Enter number of pages to search: '))

    for i in range(max_pages):
        page = 'https://www.monster.ie/jobs/search/?q='+query+'&where=Dublin__2C-Dublin&sort=dt.rv.di&page=' + str(i+1)
        soup = getPageSource(page)
        print("Scraping Page number: " + str(i+1))
        find_data(soup)

Output:

Date,Company,Role,URL
Posted today,Solas IT,QA Engineer,https://job-openings.monster.ie/QA-Engineer-Dublin-Dublin-Ireland-Solas-IT/11/195166152
Posted today,Hays Ireland,Resident Engineer,https://job-openings.monster.ie/Resident-Engineer-Dublin-Dublin-Ireland-Hays-Ireland/11/195162741
Posted today,IT Alliance Group,Presales Consultant,https://job-openings.monster.ie/Presales-Consultant-Dublin-Dublin-IE-IT-Alliance-Group/11/192391675
Posted today,Allen Recruitment Consulting,Automation Test Engineer,https://job-openings.monster.ie/Automation-Test-Engineer-Dublin-West-Dublin-IE-Allen-Recruitment-Consulting/11/191229801
Posted today,Accenture,Privacy Analyst,https://job-openings.monster.ie/Privacy-Analyst-Dublin-Dublin-IE-Accenture/11/195164219
Date,Company,Role,URL
Posted today,Solas IT,Automation Engineer,https://job-openings.monster.ie/Automation-Engineer-Dublin-Dublin-Ireland-Solas-IT/11/195159636
Posted today,PROTENTIAL RESOURCES,Desktop Support Engineer,https://job-openings.monster.ie/Desktop-Support-Engineer-Santry-Dublin-Ireland-PROTENTIAL-RESOURCES/11/195159322
Posted today,IT Alliance Group,Service Desk Team Lead,https://job-openings.monster.ie/Service-Desk-Team-Lead-Dublin-Dublin-IE-IT-Alliance-Group/11/193234050
Posted today,Osborne,IT Internal Audit Specialist – Dublin City Centre,https://job-openings.monster.ie/IT-Internal-Audit-Specialist-–-Dublin-City-Centre-Dublin-City-Centre-Dublin-IE-Osborne/11/192169909
Posted today,Brightwater Recruitment Specialists,Corporate Tax Partner Designate,https://job-openings.monster.ie/Corporate-Tax-Partner-Designate-Dublin-2-Dublin-IE-Brightwater-Recruitment-Specialists/11/183837695

Upvotes: 0

Views: 101

Answers (1)

Colin Ricardo
Colin Ricardo

Reputation: 17249

Because you're calling find_data(soup) max_pages number of times this means you're also doing the following multiple times:

 df = pd.DataFrame(l)
 df = df[['Date', 'Company', 'Role', 'URL']]
 df = df.dropna()
 df = df.sort_values(by=['Date'], ascending=False)
 df.to_csv("csv_files/pandas_data.csv", mode='a', header=True, index=False)

Try changing the find_data() function to take in a list, fill it, and return it. Then, after you've called the function, you can can add the header and write it to the file with to_csv().

For example:

def find_data(soup, l):
    for div in soup.find_all('div', class_ = 'js_result_container'):
        d = {}
        try:
            d["Company"] = div.find('div', class_= 'company').find('a').find('span').get_text()
            d["Date"] = div.find('div', {'class':['job-specs-date', 'job-specs-date']}).find('p').find('time').get_text()
            pholder = div.find('div', class_= 'jobTitle').find('h2').find('a')
            d["URL"] = pholder['href']
            d["Role"] = pholder.get_text().strip()
            l.append(d)
        except:
            pass
   return l

if __name__ == '__main__':

    f = open("csv_files/pandas_data.csv", "w")
    f.truncate()
    f.close()

    query = input('Enter role to search: ')
    max_pages = int(input('Enter number of pages to search: '))
    l = []
    for i in range(max_pages):
        page = 'https://www.monster.ie/jobs/search/?q='+query+'&where=Dublin__2C-Dublin&sort=dt.rv.di&page=' + str(i+1)
        soup = getPageSource(page)
        print("Scraping Page number: " + str(i+1))
        l = find_data(soup)

    df = pd.DataFrame(l)
    df = df[['Date', 'Company', 'Role', 'URL']]
    df = df.dropna()
    df = df.sort_values(by=['Date'], ascending=False)
    df.to_csv("csv_files/pandas_data.csv", mode='a', header=True, index=False)

Upvotes: 1

Related Questions