Reputation: 4521
I have some text from an XML document where I am trying to extract the text within tags contain certain words.
For example below:
search('adverse')
should return the text of all the tags containing the word 'adverse'
Out:
[
"<item>The most common adverse reactions reported in subjects receiving coadministered dutasteride and tamsulosin were impotence, decreased libido, breast disorders (including breast enlargement and tenderness), ejaculation disorders, and dizziness.</item>"
]
and search('clinical')
should return two results since two tags contain those words.
Out:
[
"<title>6.1 Clinical Trials Experience</title>",
"<paragraph id="ID41">The clinical efficacy and safety of coadministered dutasteride and tamsulosin, which are individual components of dutasteride and tamsulosin hydrochloride capsules, have been evaluated in a multicenter, randomized, double-blind, parallel group trial (the Combination with Alpha-Blocker Therapy, or CombAT, trial) </paragraph>"
]
What tools should I use for this? RegEx? BS4? Any suggestions are greatly appreciated.
Example Text:
</highlight>
</excerpt>
<component>
<section id="ID40">
<id root="fbc21d1a-2fb2-47b1-ac53-f84ed1428bb4"></id>
<title>6.1 Clinical Trials Experience</title>
<text>
<paragraph id="ID41">The clinical efficacy and safety of coadministered dutasteride and tamsulosin, which are individual components of dutasteride and tamsulosin hydrochloride capsules, have been evaluated in a multicenter, randomized, double-blind, parallel group trial (the Combination with Alpha-Blocker Therapy, or CombAT, trial) </paragraph>
<list id="ID42" listtype="unordered" stylecode="Disc">
<item>The most common adverse reactions reported in subjects receiving coadministered dutasteride and tamsulosin were impotence, decreased libido, breast disorders (including breast enlargement and tenderness), ejaculation disorders, and dizziness.</item>
Upvotes: 0
Views: 55
Reputation: 6653
You could either hardcode it with a regex, or parse your xml file with a library like lxml
With a regex that would be:
import re
your_text = "(...)"
def search(instr):
return re.findall(r"<.+>.*{}.*<.+>".format(instr), your_text, re.MULTILINE)
print(search("safety"))
Upvotes: 1