Yun Tae Hwang
Yun Tae Hwang

Reputation: 1471

How to count how many pages on a web page using python

I am trying to make a program (for practice) that count how many chapters and verses in each book of bible.... So let say if I want to know total chapters or verses in book 1 then it will give me the total number. If I only want to know the number of verse in chapter 4 in book 2 then it only gives me the number of verses in that particular chapter. Also, same for the chapters.

So, my logic was to looks for font class: tk4l (which is unique font size for the body of context) from this web site:

http://www.holybible.or.kr/B_NIV/cgi/bibleftxt.php?VR=NIV&VL=1&CN=1&CV=99

and if it finds the font class then add 1 to my count of chapters and if fails to find the font class move on to the next book ( book += 1 ) and do the same thing..

I was going to use :

import requests
from bs4 import BeautifulSoup
import operator



def read_chapters(max_books, max_chapters):
    book=1
    chapter=1
    while chapter <= max_chapters:
         url = 'http://www.holybible.or.kr/B_NIV/cgi/bibleftxt.php?VR=NIV&VL={}&CN={}&CV=99'.format(book, chapter)
         source_code = requests.get(url).text
         soup = BeautifulSoup(source_code, "html.parser")
         for bible_text in soup.findAll('font', {'class': 'tk4l'}):

and so on...

My question is...

1) how can I print that chapter count?? 2) I have no idea how I should count the number of verses..

I just started to study Python. Please help me on this.. T.T

Upvotes: 1

Views: 4515

Answers (1)

Tristan
Tristan

Reputation: 1576

First you need to get the HTML content of that page. I recommend using the package requests.

import requests
page = requests.get("http://www.holybible.or.kr/B_NIV/cgi/bibleftxt.php?VR=NIV&VL=1&CN=1&CV=99")

To expand on your idea of counting the font usage of tk4l, this could be done by searching for this sub string in the webpagem content:

verses = str(page.content).count("font class=tk4l")
print(verses)

To get the number of chapters you could proceed in a similar manner with string operations if you identify a unique attribute about the way they are listed.

EDIT: To expand on the number of chapters. This is a little tricky, since the only attribute I immediately notice is, that the chapters are in the pagination. Without using any further packages, you could use some string operations to iterate through the pagination and find the maximum. I am afraid the approach is a bit tricky, but it should work for identifying the maximum number of chapters on the page you mentioned.

import requests
page = requests.get("http://www.holybible.or.kr/B_NIV/cgi/bibleftxt.php?VR=NIV&VL=1&CN=1&CV=99")
verses = str(page.content).split("http://www.holybible.or.kr/images/l_arrow.gif")[1].split("http://www.holybible.or.kr/images/arrow.gif")[0]
currmax = 0
for i in range(len(verses)):
    if verses[i] == ">":
        if verses[i+2:i+7] == "</a>&":
            if currmax < int(verses[i+1]):
                currmax = int(verses[i+1])
        if verses[i+3:i+8] == "</a>&":
            if currmax < int(verses[i+1:i+3]):
                currmax = int(verses[i+1:i+3])
print(currmax)

EDIT 2: With regular expressions, the same task can be accomplished in a more compact manner:

import requests
import re
page = requests.get("http://www.holybible.or.kr/B_NIV/cgi/bibleftxt.php?VR=NIV&VL=1&CN=1&CV=99")
contents = str(page.content)
x = max(int(i) for i in re.findall(r'>(\d+)</[ab]>&nbsp;', contents))
print(x)

Upvotes: 2

Related Questions