mazouu rahim
mazouu rahim

Reputation: 133

how to get a parent of an <span> tag containing a specific text

I want to identify some section from an html file , each section are encapsulated in a div. To find the section, the title is usually encapsulated in a span tag.

so I have try those two solutions :

1)

doc_html = BeautifulSoup(doc_html, 'html.parser')
my_file['div'] = doc_html.find_all('div')
for div in my_file['div'] :
    for span in div.find_all('span'):
        if span.text == 'ABSTRACT':
            my_file['Abstract'] = div
        if span.text == 'Keywords':
            my_file['Keywords'] = div
        if span.text == 'REFERENCES':
            my_file['References'] = div

2)

for span in doc_html.find_all('span'):
    if span.string == 'ABSTRACT':
        my_file['Abstract'] = span.parent
    if span.string == 'Keywords':
        my_file['Keywords'] = span.parent
    if span.string == 'REFERENCES':
        my_file['References'] = span.parent

those two solutions works well for the section 'abstract' and 'keywords' but it doesn't work for the word 'references' and i don't understand because this word is also encapsulated in a span tag :

<span style="font-family: Times New Roman,Bold; font-size:10px">REFERENCES 
<br/></span>

and finally i would like to know if is a way to optimize this code like put it in one line for instance

Upvotes: 2

Views: 380

Answers (1)

alecxe
alecxe

Reputation: 474041

I think it's just that there is a newline character after the "REFERENCES", strip it:

text = span.get_text(strip=True)
if text == 'ABSTRACT':
    my_file['Abstract'] = div
if text == 'Keywords':
    my_file['Keywords'] = div
if text == 'REFERENCES':
    my_file['References'] = div

Note that you can simplify the code and make it more pythonic by having a mapping between the texts and output dictionary keys:

mapping = {'ABSTRACT': 'Abstract', 'Keywords': 'Keywords', 'REFERENCES': 'References'}
for div in my_file['div'] :
    for span in div.find_all('span'):
        text = span.get_text(strip=True)

        if text in mapping:
            my_file[mapping[text]] = div

We can also simplify the "element locating" part of the code, but, without knowing at least the context of the problem and the desired output, it's difficult to help here.

Upvotes: 1

Related Questions