Reputation: 13
I'm bit new to python, I've trying to scrap a page using Beautiful Soup and output the results in a JSON format. SimpleJson
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import json as simplejson
webpages = (
"page1.html",
"page2.html",
"page3.html"
)
my_dict = {}
for webpage in webpages:
soup = BeautifulSoup(open(webpage))
title = soup.title.string
body = soup.find(id="bodyText")
my_dict['title'] = title
my_dict['body']= str(body)
print simplejson.dumps(my_dict,indent=4)
I'm only getting the results of the last page? Can someone tell me where I'm going wrong?
Upvotes: 1
Views: 1854
Reputation: 6387
results = [] # you need a list to collect all dictionaries
for webpage in webpages:
soup = BeautifulSoup(open(webpage))
this_dict = {}
this_dict['title'] = soup.title.string
this_dict['body'] = soup.find(id="bodyText")
results.append(this_dict)
print simplejson.dumps(results, indent=4)
I have a feeling, however, that what you want it is a dictionary, where keys are titles of page and values are bodies:
results = {}
for webpage in webpages:
soup = BeautifulSoup(open(webpage))
results[soup.title.string] = soup.find(id='bodyText')
print simplejson.dumps(results, indent=4)
Or using comprehensions:
soups = (BeautifulSoup(open(webpage)) for webpage in webpages)
results = {soup.title.string: soup.find(id='bodyText') for soup in soups}
print simplejson.dumps(results, indent=4)
PS. Please forgive me mistakes, if any occur, I am writing from a phone...
Upvotes: 1
Reputation: 8709
Since you are destroying title and body in each iteration, there are two ways of handling it:
Create a list of all dictionaries as:
all_dict=[]
for webpage in webpages:
soup = BeautifulSoup(open(webpage))
title = soup.title.string
body = soup.find(id="bodyText")
my_dict['title'] = title
my_dict['body']= str(body)
all_dict.append(my_dict)
for my_dict in alldict:
print simplejson.dumps(my_dict,indent=4)
Use iteration number using enumerate()
to create different title and body names like title1, body1, title2, body2, etc. This way you preserve each title and body name in same dictionary as:
for i,webpage in enumerate(webpages):
soup = BeautifulSoup(open(webpage))
title = soup.title.string
body = soup.find(id="bodyText")
my_dict['title'+str(i)] = title
my_dict['body'+str(i)]= str(body)
print simplejson.dumps(my_dict,indent=4)
Upvotes: 0
Reputation: 22954
An indentation can cause wonders in python , only the last line needed to be indented inside the for loop
from bs4 import BeautifulSoup
import json as simplejson
webpages = (
"page1.html",
"page2.html",
"page3.html"
)
my_dict = {}
for webpage in webpages:
soup = BeautifulSoup(open(webpage))
title = soup.title.string
body = soup.find(id="bodyText")
my_dict['title'] = title
my_dict['body']= str(body)
print simplejson.dumps(my_dict,indent=4)
or if you really want all the data in one dictioanry, then you could try:
my_dict['title'] = my_dict.get("title","")+","+title
my_dict['body']= my_dict.get("body","")+","+body
So the code may look like:
from bs4 import BeautifulSoup
import json as simplejson
webpages = (
"page1.html",
"page2.html",
"page3.html"
)
my_dict = {}
for webpage in webpages:
soup = BeautifulSoup(open(webpage))
title = soup.title.string
body = soup.find(id="bodyText")
my_dict['title'] = my_dict.get("title",[]).append(title)
my_dict['body']= my_dict.get("body",[]).append(body)
print simplejson.dumps(my_dict,indent=4)
Upvotes: -2
Reputation: 6561
You are overwriting your dictionary each time through the loop. Tab the print
statement over so it is included in the for
loop:
for webpage in webpages:
soup = BeautifulSoup(open(webpage))
title = soup.title.string
body = soup.find(id="bodyText")
my_dict['title'] = title
my_dict['body']= str(body)
print simplejson.dumps(my_dict,indent=4)
Upvotes: 3