Reputation: 221
I want to parse this post from Quora or a generic post with code.
Example: http://qr.ae/Rkplrt
Through using Selenium, a Python library, I can get the HTML inside of the post:
h = html2text.HTML2Text()
content = ans.find_element_by_class_name('inline_editor_value')
html_string = content.get_attribute('innerHTML')
text = h.handle(html_string)
print text
I would like it all to be a single chunk of text. But in the case of those tables that contain code, html2text inserts many \n
and does not handle the indices of the rows.
So I can see this:
https://imageshack.com/i/paEKbzT4p (This is the principal div that contains the table with code.)
https://imageshack.com/i/hlIxFayop (The text that html2text extracts)
https://imageshack.com/i/hlHFBXvQp (Instead, this is the final print of the text, with problems with the index rows and extra \n
s.)
I had already tried different settings, like bypasse_tables, present in this guide on github: (https://github.com/Alir3z4/html2text/blob/master/docs/usage.md#available-options), but had no success.
Could someone tell me how to use html2text in this case?
Upvotes: 1
Views: 705
Reputation: 658
You could use BeautifulSoup
(and I'm using urllib
to communicate with the website because I'm not familiar with selenium
but I'm sure it can be got to work) to do some simple parsing of the HTML:
import urllib
from bs4 import BeautifulSoup
# urllib opener
opener = urllib.request.build_opener(
urllib.request.HTTPRedirectHandler(),
urllib.request.HTTPHandler(debuglevel=0),
urllib.request.HTTPSHandler(debuglevel=0))
# Get page
html = opener.open("http://qr.ae/Rkplrt").read()
# Create BeautifulSoup object
soup = BeautifulSoup(html, "lxml")
# Find the HTML element you want
answer = soup.find('div', { 'class' : 'ExpandedQText ExpandedAnswer' })
# Remove the stuff you don't want
answer.find('td', { 'class' : 'linenos' }).extract()
answer.find('div', { 'class' : 'ContentFooter AnswerFooter' }).extract()
# Print
print("\n".join(answer.stripped_strings))
I'm not entirely sure what you want to extract. The above gives you just the answer, including the code, without the line numbers:
This is:
#include <stdio.h>
int v,i,j,k,l,s,a[99];
main()
{
for(scanf("%d", &s);*a-s;v=a[j*=v]-a[i],k=i<s,j+=(v=j<s&&(!k&&!!printf(2+"\n\n%c"-(!l<<!j)," #Q"[l^v?(l^j)&1:2])&&++l||a[i]<s&&v&&v-i+j&&v+i-j))&&!(l%=s),v||(i==j?a[i+=k]=0:++a[i])>=s*k&&++a[--i]);
}
Update: OP asked for <a>
and <img>
tags to be replaced by their href
and src
values. The version of my script below should take care of this. It also handles multiple answers.
import urllib
from bs4 import BeautifulSoup
# urllib opener
opener = urllib.request.build_opener(
urllib.request.HTTPRedirectHandler(),
urllib.request.HTTPHandler(debuglevel=0),
urllib.request.HTTPSHandler(debuglevel=0))
# Get page
html = opener.open("https://www.quora.com/Is-it-too-late-for-an-X-year-old-to-learn-how-to-program").read()
# Create BeautifulSoup object
soup = BeautifulSoup(html, "lxml")
# Place to store the final output
output = ''
# Find the HTML element you want
answers = soup.find_all('div', { 'class' : 'ExpandedQText ExpandedAnswer' })
for answer in answers:
# Remove the stuff you don't want
linenos = answer.find('td', { 'class' : 'linenos' })
if linenos is not None:
linenos.extract()
answer.find('div', { 'class' : 'ContentFooter AnswerFooter' }).extract()
# Replace <a> with its url
for link in answer.select('a'):
url = link['href']
link.insert_after(url)
link.extract()
# Replace <a> with its url
for img in answer.select('img'):
url = img['src']
img.insert_after(url)
img.extract()
# Attach to output
output += "\n".join(answer.stripped_strings) + '\n\n'
# Print
print(output)
Upvotes: 1
Reputation: 474041
You don't actually need to use HTML2Text
at all.
selenium
can get you the "text" directly:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://qr.ae/Rkplrt")
print(driver.find_element_by_class_name('inline_editor_content').text)
It prints the content of the post:
The single line of code must be useful, not something meant to be confusing or obfuscating.
...
What examples have you created or encountered ?
Upvotes: 1