I want to parse this post from Quora or a generic post with code. Example : http://qr.ae/Rkplrt Through using Selenium, a Python library, I can get the HTML inside of the post: h = html2text.HTML2Text() content = ans.find_element_by_class_name('inline_editor_value') html_string = content.get_attribute('innerHTML') text = h.handle(html_string) print text I would like it all to be a single chunk of text. But in the case of those tables that contain code, html2text inserts many \n and does not handle the indices of the rows. So I can see this: https://imageshack.com/i/paEKbzT4p (This is the principal div that contains the table with code.) https://imageshack.com/i/hlIxFayop (The text that html2text extracts) https://imageshack.com/i/hlHFBXvQp (Instead, this is the final print of the text, with problems with the index rows and extra \n s.) I had already tried different settings, like bypasse_tables, present in this guide on github: ( https://github.com/Alir3z4/html2text/blob/master/docs/usage.md#available-options ), but had no success. Could someone tell me how to use html2text in this case?

Parse answer in Quora that contains code

Reputation: 658

You could use BeautifulSoup (and I'm using urllib to communicate with the website because I'm not familiar with selenium but I'm sure it can be got to work) to do some simple parsing of the HTML:

import urllib
from bs4 import BeautifulSoup

# urllib opener
opener = urllib.request.build_opener(
          urllib.request.HTTPRedirectHandler(),
          urllib.request.HTTPHandler(debuglevel=0),
          urllib.request.HTTPSHandler(debuglevel=0))

# Get page
html = opener.open("http://qr.ae/Rkplrt").read()

# Create BeautifulSoup object
soup = BeautifulSoup(html, "lxml")

# Find the HTML element you want
answer = soup.find('div', { 'class' : 'ExpandedQText ExpandedAnswer' })

# Remove the stuff you don't want
answer.find('td', { 'class' : 'linenos' }).extract()
answer.find('div', { 'class' : 'ContentFooter AnswerFooter' }).extract()

# Print
print("\n".join(answer.stripped_strings))

I'm not entirely sure what you want to extract. The above gives you just the answer, including the code, without the line numbers:

This is:
#include <stdio.h>
int v,i,j,k,l,s,a[99];
main()
{
for(scanf("%d", &s);*a-s;v=a[j*=v]-a[i],k=i<s,j+=(v=j<s&&(!k&&!!printf(2+"\n\n%c"-(!l<<!j)," #Q"[l^v?(l^j)&1:2])&&++l||a[i]<s&&v&&v-i+j&&v+i-j))&&!(l%=s),v||(i==j?a[i+=k]=0:++a[i])>=s*k&&++a[--i]);
}

Update: OP asked for <a> and <img> tags to be replaced by their href and src values. The version of my script below should take care of this. It also handles multiple answers.

import urllib
from bs4 import BeautifulSoup

# urllib opener
opener = urllib.request.build_opener(
          urllib.request.HTTPRedirectHandler(),
          urllib.request.HTTPHandler(debuglevel=0),
          urllib.request.HTTPSHandler(debuglevel=0))

# Get page
html = opener.open("https://www.quora.com/Is-it-too-late-for-an-X-year-old-to-learn-how-to-program").read()

# Create BeautifulSoup object
soup = BeautifulSoup(html, "lxml")

# Place to store the final output
output = ''

# Find the HTML element you want
answers = soup.find_all('div', { 'class' : 'ExpandedQText ExpandedAnswer' })
for answer in answers:

  # Remove the stuff you don't want
  linenos = answer.find('td', { 'class' : 'linenos' })
  if linenos is not None:
    linenos.extract()
  answer.find('div', { 'class' : 'ContentFooter AnswerFooter' }).extract()

  # Replace <a> with its url
  for link in answer.select('a'):
    url = link['href']
    link.insert_after(url)
    link.extract()

  # Replace <a> with its url
  for img in answer.select('img'):
    url = img['src']
    img.insert_after(url)
    img.extract()

  # Attach to output
  output += "\n".join(answer.stripped_strings) + '\n\n'

# Print
print(output)

Upvotes: 1

alecxe

Reputation: 474041

You don't actually need to use HTML2Text at all.

selenium can get you the "text" directly:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://qr.ae/Rkplrt")

print(driver.find_element_by_class_name('inline_editor_content').text)

It prints the content of the post:

The single line of code must be useful, not something meant to be confusing or obfuscating.

...

What examples have you created or encountered ?

Upvotes: 1

Parse answer in Quora that contains code

Answers (2)

Related Questions