Mark K
Mark K

Reputation: 9338

Text extraction from multiple webpages (URLs in a text file)

(Environment: Python 2.7 + BeautifulSoup 4.3.2)

I am using Python and BeautifulSoup to pick up the news titles on this webpage and its subsequent pages. I don’t know how to have it automatically follow the subsequent/next pages so I put all the URLs in a text file, web list.txt.

http://www.legaldaily.com.cn/locality/node_32245.htm
http://www.legaldaily.com.cn/locality/node_32245_2.htm
http://www.legaldaily.com.cn/locality/node_32245_3.htm

. . .

Here’s what I worked out so far:

from bs4 import BeautifulSoup
import re
import urllib2
import urllib


list_open = open("web list.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")


i = 0
while i < len(line_in_list):
    soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html')
    news_list = soup.find_all(attrs={'class': "f14 blue001"})
    for news in news_list:
        print news.getText()
i + = 1

It pops up an error message saying there’s an invalid syntax.

What went wrong?

Upvotes: 1

Views: 1534

Answers (1)

Kevin
Kevin

Reputation: 76184

i + = 1

This is invalid syntax.

If you want to use the augmented assignment operator +=, you can't have a space between the plus and the equals.

i += 1

The next error you'll get is:

NameError: name 'url' is not defined

Because you never define url before you use it in the soup = line. You can fix this by iterating directly over the url list, instead of incrementing i at all.

for url in line_in_list:
    soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html')
    news_list = soup.find_all(attrs={'class': "f14 blue001"})
    for news in news_list:
        print news.getText()

Upvotes: 1

Related Questions