python how to extract text after br?

Question

I am using 2.7.8 and gone bit surprise bcz i am getting all text but the text containing after last <"br"> is not getting. Like my html page:




Here is a listing of C interview questions on “Variable Names” along with answers, explanations and/or solutions:


Which of the following is not a valid C variable name?

a) int number;

b) float rate;

c) int variable_count;

d) int $main;
   

 more 

Which of the following is true for variable names in C?

a) They can contain alphanumeric characters as well as special characters

b) It is not an error to declare a variable to be one of the keywords(like goto, static)

c) Variable names cannot start with a digit

d) Variable can be of any length
 !

and my code:

url = "http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/"
#url="http://www.sanfoundry.com/c-programming-questions-answers-variable-names-2/"
req = Request(url)
resp = urllib2.urlopen(req)
htmls = resp.read()
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmls)
for br in soup.findAll('br'):
    next = br.nextSibling
    if not (next and isinstance(next,NavigableString)):
        continue
    next2 = next.nextSibling
    if next2 and isinstance(next2,Tag) and next2.name == 'br':
        text = str(next).strip()
        if text:
            print "Found:", next.encode('utf-8')
           # print '...........sfsdsds.............',answ[0].encode('utf-8')   #

Output:

Found: 
a) int number;
Found: 
b) float rate;
Found: 
c) int variable_count;

Found: 
a) They can contain alphanumeric characters as well as special characters
Found: 
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found: 
c) Variable names cannot start with a digit

However i am not getting last "text" which is for example:

 d) int $main
    and 
 d) Variable can be of any length

which is after <"br">

And the output i am trying to get :

Found: 
a) int number;
Found: 
b) float rate;
Found: 
c) int variable_count;
Found:
d) int $main

Found: 
a) They can contain alphanumeric characters as well as special characters
Found: 
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found: 
c) Variable names cannot start with a digit
d) Variable can be of any length

Valkyrie · Accepted Answer

You could use Requests instead of urllib2, and extract xml via lxml's html module.

from lxml import html
import requests

#request page
page=requests.get("http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/")

#get content in html format
page_content=html.fromstring(page.content)

#recover all text from  elements
items=page_content.xpath('//p/text()')

the above code returns an array of all text in the document contained in elements.
With that, you can simply index into the array to print what you want.

python how to extract text after br?

Answers (2)

Related Questions