D. Shaat
D. Shaat

Reputation: 25

How to save results of web scraping Python

I am trying to scrape LexisNexis. I would like to retrieve the headlines, source, and date of the news story. Here is the code I wrote to use after using selenium to do the search for me. I am having trouble saving the data into a csv file. I keep getting encoding errors. When I am not getting encoding errors I get the data with MANY spaces and weird characters like \t\t\t\t\t\t\t\ and \n.

Here is an example of what I retrieve:

["\n \t\t\t\tNetworks Continue Hammering Indiana for Sparking a 'Firestorm' Over Religious Freedom Law\n \t\t\t", "\n \t\t\t\tAll Three Networks Pile on Indiana's 'Controversial' Law\n \t\t\t", "\n \t\t\t\tABC Continues Obsessively Bashing 'Controversial' 'Religious Freedom' Bill\n \t\t\t", "\n \t\t\t\tABC, NBC Rush to Paint Trump as a 'Moderate,' 'Trump 2.0'\n \t\t\t", '\n \t\t\t\tCBS Hits the Panic Button, Rails Against Religious Freedom Bills in Georgia, North Carolina\n \t\t\t', '\n \t\t\t\tJihad Report - October 7, 2016\n \t\t\t', '\n \t\t\t\tEducation News Roundup: May 2, 2016\n \t\t\t', '\n \t\t\t\tNBC CBS Keep Up Attack on Religious Freedom Laws\n \t\t\t', '\n \t\t\t\tNBC Slams Indiana Religious Freedom Law...Then Starts Week-Long Series on Faith\n \t\t\t', "\n \t\t\t\tNetworks Again Bash Indiana for Causing 'National Outcry' and 'Uproar' Over Religious Freedom Law\n \t\t\t"]

This is the case for headlines, dates, and sources. I am not sure what I am doing wrong here.

scd =browser.page_source
soup = BeautifulSoup(scd, "lxml")


headlines=[]
for headline in soup.findAll('a', attrs={"data-action":"title"}):
 head_line=headline.get_text()
 #head_line.strip('a>, <a data-action="title" href="#">')
 #head_line.encode('utf-8')
 Headlines = head_line.encode()
 headlines.append(head_line)

sources=[]        
 for sources in soup.findAll('a', attrs{"class":"rightpanefiltercontent notranslate", "href":"#"}):
source_only=sources.get_text()
source_only.encode('utf-8')
sources.append(source_only)
Sources = sources.encode()

dates=[]          
for dates in soup.findAll('a', attrs={"class":"rightpanefiltercontent"}):
date_only=dates.get_text()
date_only.strip('<a class="rightpanefiltercontent" href="#">')
date_only.encode()
dates.append(date_only)
Dates = dates.encode()

news=[Headlines,Sources,Dates]


result = "/Users/danashaat/Desktop/Tornadoes/IV Search News Results/data.csv"
with open(result, 'w') as result:
newswriter = csv.writer(result, dialect='excel') 
newswriter.writerow(News)

Also, here is the result for when I find headlines:

[<a data-action="title" href="#"> Networks Continue Hammering Indiana for Sparking a 'Firestorm' Over Religious Freedom Law </a>, <a data-action="title" href="#"> All Three Networks Pile on Indiana's 'Controversial' Law </a>, <a data-action="title" href="#"> ABC Continues Obsessively Bashing 'Controversial' 'Religious Freedom' Bill </a>, <a data-action="title" href="#"> ABC, NBC Rush to Paint Trump as a 'Moderate,' 'Trump 2.0' </a>, <a data-action="title" href="#"> CBS Hits the Panic Button, Rails Against Religious Freedom Bills in Georgia, North Carolina </a>, <a data-action="title" href="#"> Jihad Report - October 7, 2016 </a>, <a data-action="title" href="#"> Education News Roundup: May 2, 2016 </a>, <a data-action="title" href="#"> NBC CBS Keep Up Attack on Religious Freedom Laws </a>, <a data-action="title" href="#"> NBC Slams Indiana Religious Freedom Law...Then Starts Week-Long Series on Faith </a>, <a data-action="title" href="#"> Networks Again Bash Indiana for Causing 'National Outcry' and 'Uproar' Over Religious Freedom Law </a>]

I've been trying to figure this out for HOURS, so any help will be much appreciated.

Upvotes: 1

Views: 376

Answers (1)

Ajax1234
Ajax1234

Reputation: 71471

You can anchor your element search to the div class "item":

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import csv
d = webdriver.Chrome()
d.get('https://www.lexisnexis.com/en-us/home.page')
results = [[(lambda x:x['href'] if i == 'a' else getattr(x,'text', None))(c.find(i)) for i in ['a', 'time', 'h5', 'p']] for c in soup(d.page_source, 'html.parser').find_all('div', {'class':'item'})]
with open('lexisNexis.csv', 'w') as f:
  write = csv.writer(f)
  write.writerows([['source', 'timestamp', 'tags', 'headline'], *[re.findall('(?<=//www\.)\w+(?=\.com)', a)+b for a, *b in results if all([a, *b])]])

Output:

source,timestamp,tags,headline
law360,04 Sep 2018,Labor & Employment Law,11th Circ. Revives Claim In Ex-Aaron's Worker FMLA Suit
law360,04 Sep 2018,Workers' Compensation,Back To School: Widener's Rod Smolla Talks Free Speech
law360,04 Sep 2018,Tax Law,Ex-Sen. Kyl Chosen To Take Over McCain's Senate Seat
law360,04 Sep 2018,Energy,Mass. Top Court Says Emission Caps Apply To Electric Cos.
lexisnexis,04 Sep 2018,Immigration Law,Suspension of Premium Processing: Another Attack On the H-1B Program (Cyrus Mehta)
law360,04 Sep 2018,Real Estate Law,Privilege Waived For Some Emails In NJ Real Estate Row
law360,04 Sep 2018,Banking & Finance,Cos. Caught Between Iran Sanctions And EU Blocking Statute
law360,04 Sep 2018,Mergers & Acquisitions,Former Paper Co. Tax VP Sues For Severance Pay After Merger

Upvotes: 1

Related Questions