mazouu rahim
mazouu rahim

Reputation: 133

get a specific section from a html doc

Hello i would like get a specific section from an html doc, this section is related to a div and is encapsulated in a span tag, the section is normally at the biginning of the document.

self.contents = BeautifulSoup(convert_pdf_to_html(self.path), 'html.parser')
self.keywords = self.contents.find('span',text=re.compile("(.*keywords.*|.*key-words.*)",re.IGNORECASE)).parent

the problem is i always have a newline character which avoid me to retrieve the related div like:

<span style="font-family: EICMDB+AdvTrebu-B; font-size:8px">keywords
<br/></span>

even with using a regular expression it doesn't work and there is no option to strip the text

Upvotes: 1

Views: 186

Answers (1)

Yannis P.
Yannis P.

Reputation: 2765

First let me tell you that your regex is somewhat wrong, you have to escape - as \-

Anyways something similar worked for me but lately I can't combine regexes with find, too

contents = bs(open(path), 'html.parser')
keywords = contents.find(text = re.compile(ur"key\-?words",re.I|re.U)).parent

Upvotes: 1

Related Questions