Extracting text headers with BeautifulSoup

Question

I am attempting to extract the text headers listed in this Army Field Manual. I first converted it to an html file with adobe acrobat:

http://usacac.army.mil/sites/default/files/misc/doctrine/CDG/cdg_resources/manuals/fm/fm7_15.pdf

from requests import get
from bs4 import BeautifulSoup
import pandas as pd

url = 'C:/Users/.../fm7_15.html'

with open(url, "r") as ur:
    html = ur.read()

soup = BeautifulSoup(html)

headers_30 = soup.find_all("p", attrs={"class":
                                "s30"})
headers_33 = soup.find_all("p", attrs={"class":
                                "s33"})
headers_20 = soup.find_all("p", attrs={"class":
                                "s20"})

df30 = pd.DataFrame(headers_30,columns=["column"])
df30.to_csv('headers_30.csv', index=False)

df33 = pd.DataFrame(headers_33,columns=["column"])
df33.to_csv('headers_33.csv', index=False)

df20 = pd.DataFrame(headers_20,columns=["column"])
df20.to_csv('headers_20.csv', index=False)

There are 3 classes that compose the different headers (s30,s33,s20). I have managed to save them as csv's but the problem is that it also extracted all the associated html tags. What is the best way to go about extracting just the header text?

Dan-Dev · Accepted Answer

You can use list comprehensions to extract the text from the elements:

headers_30 = [i.text for i in soup.find_all("p", {"class":"s30"})]
headers_33 = [i.text for i in soup.find_all("p", {"class":"s33"})]
headers_20 = [i.text for i in soup.find_all("p", {"class":"s20"})]

Instead of:

headers_30 = soup.find_all("p", attrs={"class":"s30"})
headers_33 = soup.find_all("p", attrs={"class":"s33"})
headers_20 = soup.find_all("p", attrs={"class":"s20"})

Extracting text headers with BeautifulSoup

Answers (1)

Related Questions