AJ Ze
AJ Ze

Reputation: 57

Is there a way to erase or separate web scraping data? in Python

Hello I'm scraping the lastest news from ABC News website, the code i'm scraping looks like this:

 <a href="/Politics/huckabee-draws-cheers-fundraiser-west-bank-settlement/story?id=35615831" name="lpos=widget[A_3_freeformlite_4380645_homepage]&amp;lid=link[Headline_2]">Huckabee Draws Cheers at Fundraiser for West Bank Settlement<span class="metaH_timeDay">41 minutes ago</span></a>

But as you notice i got one span tag inside an a tag so when i scrape this with BeautifulSoup i get the info like this:

Huckabee Draws Cheers at Fundraiser for West Bank Settlement41 minutes ago

But it gives me the time exactly next to my data and i would like to have separated 41 minutes so it could look like this:

Huckabee Draws Cheers at Fundraiser for West Bank Settlement 41 minutes ago

or at least erase it!.

my code looks like this:

import requests
from bs4 import BeautifulSoup

url = "http://abcnews.go.com/"

r = requests.get(url)

soup = BeautifulSoup(r.content, "lxml")

for x in range(1,10):
   for link in soup.find_all("a",{"name": "lpos=widget[A_3_freeformlite_4380645_homepage]&lid=link[Headline_"+str(x)+"]"}):
    print link.text
    print link.find_all("",{"class": "metaH_timeDay"})[0].text
    print ""

Can someone help me?

Upvotes: 3

Views: 340

Answers (2)

Learner
Learner

Reputation: 5292

You can use decompose() function too-run a while lop to remove all span tag from that div-

import requests
from bs4 import BeautifulSoup

url = "http://abcnews.go.com/"

r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser")

for x in range(1):
    d=soup.select("div.h a")
    for j in d:
        j = str(j)
        f = BeautifulSoup(j,'html.parser')
        while f.span:
            f.span.decompose()
        print f.text.encode('utf-8') 

Output-

 Obama Seeks to Remove Fear From ISIS Fight
Kerry off to Paris Again for Climate Conference
Huckabee Draws Cheers at Fundraiser for West Bank Settlement
Sanders Unveils Plan to Address Climate Change
 FBI Looking Into Blatter's Role in Bribery Case
Armed Bank Robbery Suspect Shot in Miami Had Escaped From Half-Way House
13 Injured in Attack on Government Office in Western China
Police Arrest Mother of Newborn Baby Who Was Buried Alive
Shooting Suspect's Neighbor Says He Became 'More Withdrawn'
 Justice Department to Investigate Chicago Police
Hillary Clinton Corrects Flub, Thanks to Justice Breyer
 Dashcam Must Be Working
Clinton Laughs Off TrumpΓÇÖs Claims That She Lacks ΓÇÿStaminaΓÇÖ
 Man Killed in Wisconsin Standoff Was a Hostage
 2 New York College Students Abducted, Held Hostage
Transgender Actress, Warhol Muse Holly Woodlawn Dies at 69
 Mood Dour Among Venezuelan Ruling Party Backers
Hillary Clinton Says ΓÇÿWeΓÇÖre Not WinningΓÇÖ Fight Against ISIS 
Jimmy Carter Says Latest Brain Scan Shows No Cancer
One Direction Leads the Way on Twitter's List of 2015 Tweets
Promises of Grocery Stores in Needy Areas Mostly Unfulfilled
McNabb Scores Tiebreaking Goal, Kings Beat Lightning 3-1
Grocery Chains Leave Food Deserts Barren, AP Analysis Finds
Medical Examiner Shortage: Facts About Death Investigations
Roethlisberger Throws 4 TD Passes, Steelers Roll Colts 45-10
Grocery Chains Leave Food Deserts Barren, AP Analysis Finds

Upvotes: 1

Remi Guan
Remi Guan

Reputation: 22282

Let's extract it via extract():

>>> link.span.extract()     # remove the first `span` tag that we don't need
>>> time = link.span.extract()
>>> time
<span class="metaH_timeDay">2 hours, 45 minutes ago</span>
>>> link.text
' Obama Seeks to Remove Fear From ISIS Fight'
>>> time.text
'2 hours, 45 minutes ago'
>>> 

Upvotes: 1

Related Questions