Extract JSON from Text in python

Question

I want to extract JSON/dictionary from a log text.

The Sample log text:

2018-06-21 19:42:58 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'locations', 'CLOSESPIDER_TIMEOUT': '14400', 'FEED_FORMAT': 'geojson', 'LOG_FILE': '/geojson_dumps/21_Jun_2018_07_42_54/logs/coastalfarm.log', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'locations.spiders', 'SPIDER_MODULES': ['locations.spiders'], 'TELNETCONSOLE_ENABLED': '0', 'USER_AGENT': 'Mozilla/5.0'}

2018-06-21 19:43:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 369,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1718,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 6, 21, 14, 13, 0, 841666),
 'item_scraped_count': 4,
 'log_count/INFO': 8,
 'memusage/max': 56856576,
 'memusage/startup': 56856576,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 6, 21, 14, 12, 58, 499385)}

2018-06-21 19:43:00 [scrapy.core.engine] INFO: Spider closed (finished)

I have tried (\{.+$\}) as the regex expression but it gives me the the dict which is on single line, {'BOT_NAME': 'locations',..., 'USER_AGENT': 'Mozilla/5.0'} which is not what is expected.

The json/dictionary I want to extract: Note: The dictionary would not the same keys, it could differ.

{'downloader/request_bytes': 369,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1718,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 6, 21, 14, 13, 0, 841666),
 'item_scraped_count': 4,
 'log_count/INFO': 8,
 'memusage/max': 56856576,
 'memusage/startup': 56856576,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 6, 21, 14, 12, 58, 499385)}

JohnKlehm · Accepted Answer

Edit: The JSON spans multiple lines so here's what will do it:

import re

re_str = '\d{2}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} \[scrapy\.statscollectors] INFO: Dumping Scrapy stats:.({.+?\})'
stats_re = re.compile(re_str, re.MULTILINE | re.DOTALL)

for match in stats_re.findall(log):
    print(match)

If you are after only the lines from the statscollector then this should get you there (assuming that it's all on one line too):

^.*?\[scrapy.statscollectors] INFO: Dumping Scrapy stats: (\{.+$\}).*?$

Extract JSON from Text in python

Answers (2)

Related Questions