Reputation: 175
I am attempting to extract the text headers listed in this Army Field Manual. I first converted it to an html file with adobe acrobat:
http://usacac.army.mil/sites/default/files/misc/doctrine/CDG/cdg_resources/manuals/fm/fm7_15.pdf
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
url = 'C:/Users/.../fm7_15.html'
with open(url, "r") as ur:
html = ur.read()
soup = BeautifulSoup(html)
headers_30 = soup.find_all("p", attrs={"class":
"s30"})
headers_33 = soup.find_all("p", attrs={"class":
"s33"})
headers_20 = soup.find_all("p", attrs={"class":
"s20"})
df30 = pd.DataFrame(headers_30,columns=["column"])
df30.to_csv('headers_30.csv', index=False)
df33 = pd.DataFrame(headers_33,columns=["column"])
df33.to_csv('headers_33.csv', index=False)
df20 = pd.DataFrame(headers_20,columns=["column"])
df20.to_csv('headers_20.csv', index=False)
There are 3 classes that compose the different headers (s30,s33,s20). I have managed to save them as csv's but the problem is that it also extracted all the associated html tags. What is the best way to go about extracting just the header text?
Upvotes: 0
Views: 2001
Reputation: 9430
You can use list comprehensions to extract the text from the elements:
headers_30 = [i.text for i in soup.find_all("p", {"class":"s30"})]
headers_33 = [i.text for i in soup.find_all("p", {"class":"s33"})]
headers_20 = [i.text for i in soup.find_all("p", {"class":"s20"})]
Instead of:
headers_30 = soup.find_all("p", attrs={"class":"s30"})
headers_33 = soup.find_all("p", attrs={"class":"s33"})
headers_20 = soup.find_all("p", attrs={"class":"s20"})
Upvotes: 2