Reputation: 9338
(Environment: Python 2.7 + BeautifulSoup 4.3.2)
I am using Python and BeautifulSoup to pick up the news titles on this webpage and its subsequent pages. I don’t know how to have it automatically follow the subsequent/next pages so I put all the URLs in a text file, web list.txt.
http://www.legaldaily.com.cn/locality/node_32245.htm
http://www.legaldaily.com.cn/locality/node_32245_2.htm
http://www.legaldaily.com.cn/locality/node_32245_3.htm
. . .
Here’s what I worked out so far:
from bs4 import BeautifulSoup
import re
import urllib2
import urllib
list_open = open("web list.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")
i = 0
while i < len(line_in_list):
soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html')
news_list = soup.find_all(attrs={'class': "f14 blue001"})
for news in news_list:
print news.getText()
i + = 1
It pops up an error message saying there’s an invalid syntax.
What went wrong?
Upvotes: 1
Views: 1534
Reputation: 76184
i + = 1
This is invalid syntax.
If you want to use the augmented assignment operator +=
, you can't have a space between the plus and the equals.
i += 1
The next error you'll get is:
NameError: name 'url' is not defined
Because you never define url
before you use it in the soup =
line. You can fix this by iterating directly over the url list, instead of incrementing i
at all.
for url in line_in_list:
soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html')
news_list = soup.find_all(attrs={'class': "f14 blue001"})
for news in news_list:
print news.getText()
Upvotes: 1