Reputation: 5938
I have have a problem with extracting information from messy HTML data. Basically what I want to do is extract only the actual displayed words from a given piece of HTML code. Here is an example of the raw HTML data I am given
<p>I have an app which send mail to my defined mail address "[email protected]". For this i create my own Custom Email View Which contains check boxes message body and other options. Now i want that when send button is pressed my app should not go to gmail view or other email client view it directly submit the data</p>
<p>String recepientEmail = "[email protected]"; </p>
<p>// either set to destination email or leave empty</p>
<pre><code> Intent intent = new Intent(Intent.ACTION_SENDTO);
intent.setData(Uri.parse("mailto:" + recepientEmail));
startActivity(intent);
</code></pre>
<p>but on submit it opens gmail or chooser email client view but i dont want to show gmail view</p>
and I want to transform it into this
I have an app which send mail to my defined mail address "[email protected]". For this i create my own Custom Email View Which contains check boxes message body and other options. Now i want that when send button is pressed my app should not go to gmail view or other email client view it directly submit the data String recepientEmail = "[email protected]"; // either set to destination email or leave empty but on submit it opens gmail or chooser email client view but i dont want to show gmail view
So basically just retrieve everything within each of the <p>
tags and concatenate them together. I am using python so I am thinking BeautifulSoup is probably the best way to do this, however I can't seem to figure out how to do this. I am also want to repeat this over several such examples (actually millions), but each example should have at least one <p>
tag.
Upvotes: 0
Views: 3672
Reputation: 23
I recently started playing around with Beautiful Soup. I found this line of code that was extremely helpful. I will throw in my entire example in to show you.
import requests
from bs4 import BeautifulSoup
r = requests.get("your url")
html_text = r.text
soup = BeautifulSoup(html_text)
clean_html = ''.join(soup.findAll(text=True))
print(clean_html)
Hopefully this works for you/answers your question
Upvotes: 1
Reputation: 6156
html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).
<span id="midArticle_1"></span><p>Here is the First Paragraph.</p><span id="midArticle_2"></span><p>Here is the second Paragraph.</p><span id="midArticle_3"></span><p>Paragraph Three."</p>
print html.parse(url).xpath('//p/text()')
OUTPUT
['Here is the First Paragraph.', 'Here is the second Paragraph.',
'Paragraph Three."']
Upvotes: 3
Reputation: 36262
One way using BeautifulSoup
module to extract all text from <p>
tags.
Content of script.py
:
from bs4 import BeautifulSoup
import sys
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')
print(' '.join(map(lambda e: e.string, soup.find_all('p'))))
Run it like:
python3 script.py infile
That yields:
I have an app which send mail to my defined mail address "[email protected]". For this i create my own Custom Email View Which contains check boxes message body and other options. Now i want that when send button is pressed my app should not go to gmail view or other email client view it directly submit the data String recepientEmail = "[email protected]"; // either set to destination email or leave empty but on submit it opens gmail or chooser email client view but i dont want to show gmail view
Upvotes: 2