Extracting text between tags using BeautifulSoup

Question

I am trying to extract text from a series of webpages that all follow a similar format using BeautifulSoup. The html for the text I wish to extract is below. The actual link is here: http://www.p2016.org/ads1/bushad120215.html.

[Music] TEXT: The Medal of Honor is the highest award for valor in action against an enemy force

Col. Jay Vargas: We were completely surrounded, 116 Marines locking heads with 15,000 North Vietnamese. Forty hours with no sleep, fighting hand to hand.

I'd like to find a way to iterate through all the html files in my folder and extract the text between all the markers. I've included here the relevant sections of my code:

text=[]

for page in pages:
        html_doc = codecs.open(page, 'r')
        soup = BeautifulSoup(html_doc, 'html.parser')
        for t in soup.find_all(''):
            t = t.get_text()
            text.append(t.encode('utf-8'))
            print t

However, nothing is coming up. Apologies for the noob question and thanks in advance for your help.

Extracting text between tags using BeautifulSoup

Answers (1)

Related Questions