Reputation: 11

Unable to parse multiple files in a directory

I have html files on my local harddrive that I am trying to open in a webpage by sending a http request.
Once the http request is created, I am trying to parse the stored html file by passing the url:(parsing is successful when passing one file at a time but I want to do it dynamically for all the files in a directory so used for loop. This doesn't workout)

once the parsing is done, I am saving the data to json file.(works fine) I have pasted the code here:

import json
import os
from newspaper import Article
import newspaper

# initiating the server
server_start = os.system('start "HTTP Server on port 8000" cmd.exe /c {python -m http.server}')
http_server = 'http://localhost:8000/'
links = ''
path = "<path>"
for f in os.listdir(path):
    if f.endswith('.html'):
        links = http_server + path + f

    blog_post = newspaper.build(links)

    for article in blog_post.articles:
        print(article.url)

    article = Article(links)
    article.download('')
    article.parse()
    data = {"HTML": article.html, "author": article.authors, "title": article.title, "text": article.text, "date": str(article.publish_date)}

    json_data = json.dumps(data)
    with open('data.json', 'w') as outfile:
        json.dump(data, outfile)

Error message:

...\newspaper\Scripts\python.exe ".../parsing_newspaper/test1.py" [Source parse ERR] http://localhost:8000/.../cnnpolitics-russian.html Traceback (most recent call last):

File"...\newspaper\lib\site-packages\newspaper\parsers.py", line 68, in fromstring cls.doc = lxml.html.fromstring(html)

File "...\newspaper\lib\site-packages\lxml\html__init__.py", line 876, in fromstring doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)

File "...\newspaper\lib\site-packages\lxml\html__init__.py", line 762, in document_fromstring value = etree.fromstring(html, parser, **kw)

File "src\lxml\lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src\lxml\lxml.etree.c:78994)

File "src\lxml\parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:118325)

File "src\lxml\parser.pxi", line 1729, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:116883)

File "src\lxml\parser.pxi", line 1063, in lxml.etree._BaseParser._parseUnicodeDoc (src\lxml\lxml.etree.c:110870)

File "src\lxml\parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105093)

File "src\lxml\parser.pxi", line 706, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106801)

File "src\lxml\parser.pxi", line 646, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:105947)

File "", line 0 lxml.etree.XMLSyntaxError:

You must download()an article before calling parse() on it!

Traceback (most recent call last): File ".../test1.py", line 26, in article.parse()

File "...\newspaper\lib\site-packages\newspaper\article.py", line 168, in parse raise ArticleException() newspaper.article.ArticleException

Upvotes: 0

Answers (2)

Seppe Mariën

Reputation: 380

Don't know if this helps but try this:

import json
import os
from newspaper import Article
import newspaper

# initiating the server
server_start = os.system('start "HTTP Server on port 8000" cmd.exe /c {python -m http.server}')
http_server = 'http://localhost:8000/'
links = ''
path = "<path>"
for f in os.listdir(path):
    if f.endswith('.html'):
       links = http_server + path + f

       blog_post = newspaper.build(links)

       for article in blog_post.articles:
       print(article.url)

       article = Article(links)
       article.download('')
       article.parse()
       data = {"HTML": article.html, "author": article.authors, "title": article.title, "text": article.text, "date": str(article.publish_date)}

       json_data = json.dumps(data)
       with open('data.json', 'w') as outfile:
       json.dump(data, outfile)

Because otherwise if the first file is not a file with html extension, then you try to build an empty string.

or if the first one is a file with html extension but second one is not than you are going to build the same file (at least) twice

Upvotes: 1

sdikby

Reputation: 1471

A check list to follow before going deeper into debugging:

Check if a html is not empty
CHeck if ahtml is "well-formed"
Check if an artical is not empty
check if an artical is downloaded(that what the function parse() do, but that helps you to isolate "problematic" articles)

Upvotes: 0

Unable to parse multiple files in a directory

Answers (2)

Related Questions