Reputation: 19
I am trying to scrape some data from the webmd messageboard. Initially I constructed a loop to get the page numbers for each category and stored the in a dataframe. When I try to run the loop I do get the proper amount of post for each subcategory but only for the first page. Any ideas what might be going wrong?
lists2=[]
df1= pd.DataFrame (columns=['page'],data=page_links)
for j in range(len(df1)):
pages = (df1.page.iloc[j])
print(pages)
req1 = urllib.request.Request(pages, headers=headers)
resp1 = urllib.request.urlopen(req1)
soup1 = bs.BeautifulSoup(resp1,'lxml')
for body_links in soup1.find_all('div',class_="thread-detail"):
body= body_links.a.get('href')
lists2.append(body)
I am getting the proper page in the print function but then it seem to iterate only in the first page and getting the links of the posts. Also when I copy and paste the link for any page besides the first one it seems to momentarily load the first page and then goes to the proper number page. I tried to add time.sleep(1)
but does not work. Another thing I tried was to add {headers='Cookie': 'PHPSESSID=notimportant'}
Upvotes: 0
Views: 307
Reputation: 142641
If page_links
is list with urls like
page_links = ["http://...", "http://...", "http://...", ]
then you could use it directly
for url in page_links:
req1 = urllib.request.Request(url headers=headers)
If you need it in DataFrame then
for url in df1['page']:
req1 = urllib.request.Request(url headers=headers)
But if your current code displays all urls but you get result only for one page then problem is not in DataFrame
but in HTML
and find_all
.
It seems only first page has <div class_="thread-detail">
so it can't find it on other pages and it can't add it to list. You should check it again. For other pages you may need different arguments in find_all
. But without urls to these pages we can't check it and we can't help more.
It can be other common problem - page may use JavaScript
to add these elements but BeautifulSoup
can't run JavaScript - and then you woould need [Selenium](https://selenium-python.readthedocs.io/) to control web browser which can run JavaScript. You could turn off JavaScript in browser and open urls to check if you can see elements on page and in HTML in
DevTools` in Chrome/Firefox.
As for PHPSESSID
with requests
you could use Session
to get from server fresh cookies with PHPSESSID
and automatically add them to other reuqests
import requests
s = reqeusts.Session()
# get any page to get fresh cookies from server
r = s.get('http://your-domain/main-page.html')
# use it automatically with cookies
for url in page_links:
r = s.get(url)
Upvotes: 0
Reputation: 36624
Replace this line:
pages = (df1.page.iloc[j])
With this:
pages = (df1.page.iloc[j, 0])
You will now iterate through the values of your DataFrame
Upvotes: 1