Reputation: 81
URL = "https://bitcointalk.org/index.php?board=1.0"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
numberOfPages = 0
currentPage = 0
counter = 1
for blabla in soup.find_all("a" , attrs={"class" : "navPages"})[-2]:
numberOfPages = int(blabla.string)
print("Pages count: " + str(numberOfPages))
for i in range(0,numberOfPages):
URLX = "https://bitcointalk.org/index.php?board=1."+ str(currentPage)
print(URLX)
print("------------------------------------------------- Page count is: " + str(counter))
counter += 1
currentPage += 20
page1 = requests.get(URLX)
soup1 = BeautifulSoup(page1.content, 'html.parser')
time.sleep(1.0)
for random in soup1.find_all("span", attrs={"id": re.compile("^msg")}):
for b in random.find_all('a', href=True):
print (b.string)
I'm trying to go through all the pages on the "Bitcoin discussion board" and print the topic's name's from each page. It's working but for some reason, it keeps printing the topic's name twice...while going through different pages. For example:
URL (firstpage): https://bitcointalk.org/index.php?board=1.0
would print its actual content:
ABC123
anotherTopic
Then... even when the URL changes to the second page, it would still print the same topics.
And then the same thing happens for all the other pages. Each page gets printed twice (even though the URL is changing).
Any thoughts? This is my first experience with Python and BeautifulSoup.
Upvotes: 0
Views: 124
Reputation: 9572
The links for the different pages are as follows i.e. they are in increments of .40
:
https://bitcointalk.org/index.php?board=1.0
https://bitcointalk.org/index.php?board=1.40
https://bitcointalk.org/index.php?board=1.80
https://bitcointalk.org/index.php?board=1.120
So, it should be currentPage += 40
instead of current currentPage += 20
.
Upvotes: 1