ItsJustMe
ItsJustMe

Reputation: 19

Web scraping with BeautifulSoup only scrapes the first page

I am trying to scrape some data from the webmd messageboard. Initially I constructed a loop to get the page numbers for each category and stored the in a dataframe. When I try to run the loop I do get the proper amount of post for each subcategory but only for the first page. Any ideas what might be going wrong?

lists2=[]
df1= pd.DataFrame (columns=['page'],data=page_links)
for j in range(len(df1)):
   pages = (df1.page.iloc[j])
   print(pages)
   req1 = urllib.request.Request(pages, headers=headers)
   resp1 = urllib.request.urlopen(req1)
   soup1 = bs.BeautifulSoup(resp1,'lxml')
   for body_links in soup1.find_all('div',class_="thread-detail"):
       body= body_links.a.get('href')
       lists2.append(body)

I am getting the proper page in the print function but then it seem to iterate only in the first page and getting the links of the posts. Also when I copy and paste the link for any page besides the first one it seems to momentarily load the first page and then goes to the proper number page. I tried to add time.sleep(1) but does not work. Another thing I tried was to add {headers='Cookie': 'PHPSESSID=notimportant'}

Upvotes: 0

Views: 307

Answers (2)

furas
furas

Reputation: 142641

If page_links is list with urls like

page_links = ["http://...", "http://...", "http://...", ]

then you could use it directly

for url in page_links:
    req1 = urllib.request.Request(url headers=headers)

If you need it in DataFrame then

for url in df1['page']:
    req1 = urllib.request.Request(url headers=headers)

But if your current code displays all urls but you get result only for one page then problem is not in DataFrame but in HTML and find_all.

It seems only first page has <div class_="thread-detail"> so it can't find it on other pages and it can't add it to list. You should check it again. For other pages you may need different arguments in find_all. But without urls to these pages we can't check it and we can't help more.

It can be other common problem - page may use JavaScript to add these elements but BeautifulSoup can't run JavaScript - and then you woould need [Selenium](https://selenium-python.readthedocs.io/) to control web browser which can run JavaScript. You could turn off JavaScript in browser and open urls to check if you can see elements on page and in HTML inDevTools` in Chrome/Firefox.


As for PHPSESSID with requests you could use Session to get from server fresh cookies with PHPSESSID and automatically add them to other reuqests

import requests

s = reqeusts.Session()

# get any page to get fresh cookies from server
r = s.get('http://your-domain/main-page.html')

# use it automatically with cookies
for url in page_links:
    r = s.get(url)

Upvotes: 0

Nicolas Gervais
Nicolas Gervais

Reputation: 36624

Replace this line:

pages = (df1.page.iloc[j])

With this:

pages = (df1.page.iloc[j, 0])

You will now iterate through the values of your DataFrame

Upvotes: 1

Related Questions