adarvishian
adarvishian

Reputation: 175

Extracting text headers with BeautifulSoup

I am attempting to extract the text headers listed in this Army Field Manual. I first converted it to an html file with adobe acrobat:

http://usacac.army.mil/sites/default/files/misc/doctrine/CDG/cdg_resources/manuals/fm/fm7_15.pdf

from requests import get
from bs4 import BeautifulSoup
import pandas as pd

url = 'C:/Users/.../fm7_15.html'

with open(url, "r") as ur:
    html = ur.read()

soup = BeautifulSoup(html)

headers_30 = soup.find_all("p", attrs={"class":
                                "s30"})
headers_33 = soup.find_all("p", attrs={"class":
                                "s33"})
headers_20 = soup.find_all("p", attrs={"class":
                                "s20"})

df30 = pd.DataFrame(headers_30,columns=["column"])
df30.to_csv('headers_30.csv', index=False)

df33 = pd.DataFrame(headers_33,columns=["column"])
df33.to_csv('headers_33.csv', index=False)

df20 = pd.DataFrame(headers_20,columns=["column"])
df20.to_csv('headers_20.csv', index=False)

There are 3 classes that compose the different headers (s30,s33,s20). I have managed to save them as csv's but the problem is that it also extracted all the associated html tags. What is the best way to go about extracting just the header text?

Upvotes: 0

Views: 2001

Answers (1)

Dan-Dev
Dan-Dev

Reputation: 9430

You can use list comprehensions to extract the text from the elements:

headers_30 = [i.text for i in soup.find_all("p", {"class":"s30"})]
headers_33 = [i.text for i in soup.find_all("p", {"class":"s33"})]
headers_20 = [i.text for i in soup.find_all("p", {"class":"s20"})]

Instead of:

headers_30 = soup.find_all("p", attrs={"class":"s30"})
headers_33 = soup.find_all("p", attrs={"class":"s33"})
headers_20 = soup.find_all("p", attrs={"class":"s20"})

Upvotes: 2

Related Questions