vdkotian
vdkotian

Reputation: 559

Extract JSON from Text in python

I want to extract JSON/dictionary from a log text.

The Sample log text:

2018-06-21 19:42:58 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'locations', 'CLOSESPIDER_TIMEOUT': '14400', 'FEED_FORMAT': 'geojson', 'LOG_FILE': '/geojson_dumps/21_Jun_2018_07_42_54/logs/coastalfarm.log', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'locations.spiders', 'SPIDER_MODULES': ['locations.spiders'], 'TELNETCONSOLE_ENABLED': '0', 'USER_AGENT': 'Mozilla/5.0'}

2018-06-21 19:43:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 369,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1718,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 6, 21, 14, 13, 0, 841666),
 'item_scraped_count': 4,
 'log_count/INFO': 8,
 'memusage/max': 56856576,
 'memusage/startup': 56856576,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 6, 21, 14, 12, 58, 499385)}

2018-06-21 19:43:00 [scrapy.core.engine] INFO: Spider closed (finished)

I have tried (\{.+$\}) as the regex expression but it gives me the the dict which is on single line, {'BOT_NAME': 'locations',..., 'USER_AGENT': 'Mozilla/5.0'} which is not what is expected.

The json/dictionary I want to extract: Note: The dictionary would not the same keys, it could differ.

{'downloader/request_bytes': 369,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1718,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 6, 21, 14, 13, 0, 841666),
 'item_scraped_count': 4,
 'log_count/INFO': 8,
 'memusage/max': 56856576,
 'memusage/startup': 56856576,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 6, 21, 14, 12, 58, 499385)}

Upvotes: 2

Views: 5794

Answers (2)

Dustin Oprea
Dustin Oprea

Reputation: 10236

Using a JSON tokenizer makes this a very simple and efficient task, as long as you have an anchor to search for in the original document that allows you to at least identify the beginning of the JSON blob. This uses json-five to extract JSON from HTML:

import json5.tokenizer

with open('5f32d5b4e2c432f660e1df44.html') as f:
    document = f.read()

search_for = "window.__INITIAL_STATE__="
i = document.index(search_for)
j = i + len(search_for)
extract_from = document[j:]

tokens = json5.tokenizer.tokenize(extract_from)
stack = []
collected = []
for token in tokens:
    collected.append(token.value)

    if token.type in ('LBRACE', 'LBRACKET'):
        stack.append(token)
    elif token.type in ('RBRACE', 'RBRACKET'):
        stack.pop()

    if not stack:
        break

json_blob = ''.join(collected)

Note that this accounts for the JSON both being a complex (object, list) or scalar type.

Upvotes: 0

JohnKlehm
JohnKlehm

Reputation: 2398

Edit: The JSON spans multiple lines so here's what will do it:

import re

re_str = '\d{2}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} \[scrapy\.statscollectors] INFO: Dumping Scrapy stats:.({.+?\})'
stats_re = re.compile(re_str, re.MULTILINE | re.DOTALL)

for match in stats_re.findall(log):
    print(match)

If you are after only the lines from the statscollector then this should get you there (assuming that it's all on one line too):

^.*?\[scrapy.statscollectors] INFO: Dumping Scrapy stats: (\{.+$\}).*?$

Upvotes: 3

Related Questions