Reputation: 57
Hello I'm scraping the lastest news from ABC News website, the code i'm scraping looks like this:
<a href="/Politics/huckabee-draws-cheers-fundraiser-west-bank-settlement/story?id=35615831" name="lpos=widget[A_3_freeformlite_4380645_homepage]&lid=link[Headline_2]">Huckabee Draws Cheers at Fundraiser for West Bank Settlement<span class="metaH_timeDay">41 minutes ago</span></a>
But as you notice i got one span tag inside an a tag so when i scrape this with BeautifulSoup i get the info like this:
Huckabee Draws Cheers at Fundraiser for West Bank Settlement41 minutes ago
But it gives me the time exactly next to my data and i would like to have separated 41 minutes so it could look like this:
Huckabee Draws Cheers at Fundraiser for West Bank Settlement 41 minutes ago
or at least erase it!.
my code looks like this:
import requests
from bs4 import BeautifulSoup
url = "http://abcnews.go.com/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
for x in range(1,10):
for link in soup.find_all("a",{"name": "lpos=widget[A_3_freeformlite_4380645_homepage]&lid=link[Headline_"+str(x)+"]"}):
print link.text
print link.find_all("",{"class": "metaH_timeDay"})[0].text
print ""
Can someone help me?
Upvotes: 3
Views: 340
Reputation: 5292
You can use decompose()
function too-run a while lop to remove all span
tag from that div
-
import requests
from bs4 import BeautifulSoup
url = "http://abcnews.go.com/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
for x in range(1):
d=soup.select("div.h a")
for j in d:
j = str(j)
f = BeautifulSoup(j,'html.parser')
while f.span:
f.span.decompose()
print f.text.encode('utf-8')
Output-
Obama Seeks to Remove Fear From ISIS Fight
Kerry off to Paris Again for Climate Conference
Huckabee Draws Cheers at Fundraiser for West Bank Settlement
Sanders Unveils Plan to Address Climate Change
FBI Looking Into Blatter's Role in Bribery Case
Armed Bank Robbery Suspect Shot in Miami Had Escaped From Half-Way House
13 Injured in Attack on Government Office in Western China
Police Arrest Mother of Newborn Baby Who Was Buried Alive
Shooting Suspect's Neighbor Says He Became 'More Withdrawn'
Justice Department to Investigate Chicago Police
Hillary Clinton Corrects Flub, Thanks to Justice Breyer
Dashcam Must Be Working
Clinton Laughs Off TrumpΓÇÖs Claims That She Lacks ΓÇÿStaminaΓÇÖ
Man Killed in Wisconsin Standoff Was a Hostage
2 New York College Students Abducted, Held Hostage
Transgender Actress, Warhol Muse Holly Woodlawn Dies at 69
Mood Dour Among Venezuelan Ruling Party Backers
Hillary Clinton Says ΓÇÿWeΓÇÖre Not WinningΓÇÖ Fight Against ISIS
Jimmy Carter Says Latest Brain Scan Shows No Cancer
One Direction Leads the Way on Twitter's List of 2015 Tweets
Promises of Grocery Stores in Needy Areas Mostly Unfulfilled
McNabb Scores Tiebreaking Goal, Kings Beat Lightning 3-1
Grocery Chains Leave Food Deserts Barren, AP Analysis Finds
Medical Examiner Shortage: Facts About Death Investigations
Roethlisberger Throws 4 TD Passes, Steelers Roll Colts 45-10
Grocery Chains Leave Food Deserts Barren, AP Analysis Finds
Upvotes: 1
Reputation: 22282
Let's extract it via extract()
:
>>> link.span.extract() # remove the first `span` tag that we don't need
>>> time = link.span.extract()
>>> time
<span class="metaH_timeDay">2 hours, 45 minutes ago</span>
>>> link.text
' Obama Seeks to Remove Fear From ISIS Fight'
>>> time.text
'2 hours, 45 minutes ago'
>>>
Upvotes: 1