Reputation: 133
I want to identify some section from an html file , each section are encapsulated in a div. To find the section, the title is usually encapsulated in a span tag.
so I have try those two solutions :
1)
doc_html = BeautifulSoup(doc_html, 'html.parser')
my_file['div'] = doc_html.find_all('div')
for div in my_file['div'] :
for span in div.find_all('span'):
if span.text == 'ABSTRACT':
my_file['Abstract'] = div
if span.text == 'Keywords':
my_file['Keywords'] = div
if span.text == 'REFERENCES':
my_file['References'] = div
2)
for span in doc_html.find_all('span'):
if span.string == 'ABSTRACT':
my_file['Abstract'] = span.parent
if span.string == 'Keywords':
my_file['Keywords'] = span.parent
if span.string == 'REFERENCES':
my_file['References'] = span.parent
those two solutions works well for the section 'abstract' and 'keywords' but it doesn't work for the word 'references' and i don't understand because this word is also encapsulated in a span tag :
<span style="font-family: Times New Roman,Bold; font-size:10px">REFERENCES
<br/></span>
and finally i would like to know if is a way to optimize this code like put it in one line for instance
Upvotes: 2
Views: 380
Reputation: 474041
I think it's just that there is a newline character after the "REFERENCES", strip it:
text = span.get_text(strip=True)
if text == 'ABSTRACT':
my_file['Abstract'] = div
if text == 'Keywords':
my_file['Keywords'] = div
if text == 'REFERENCES':
my_file['References'] = div
Note that you can simplify the code and make it more pythonic by having a mapping between the texts and output dictionary keys:
mapping = {'ABSTRACT': 'Abstract', 'Keywords': 'Keywords', 'REFERENCES': 'References'}
for div in my_file['div'] :
for span in div.find_all('span'):
text = span.get_text(strip=True)
if text in mapping:
my_file[mapping[text]] = div
We can also simplify the "element locating" part of the code, but, without knowing at least the context of the problem and the desired output, it's difficult to help here.
Upvotes: 1