Reputation: 559
I want to extract JSON/dictionary from a log text.
The Sample log text:
2018-06-21 19:42:58 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'locations', 'CLOSESPIDER_TIMEOUT': '14400', 'FEED_FORMAT': 'geojson', 'LOG_FILE': '/geojson_dumps/21_Jun_2018_07_42_54/logs/coastalfarm.log', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'locations.spiders', 'SPIDER_MODULES': ['locations.spiders'], 'TELNETCONSOLE_ENABLED': '0', 'USER_AGENT': 'Mozilla/5.0'}
2018-06-21 19:43:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 369,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 1718,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 6, 21, 14, 13, 0, 841666),
'item_scraped_count': 4,
'log_count/INFO': 8,
'memusage/max': 56856576,
'memusage/startup': 56856576,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 6, 21, 14, 12, 58, 499385)}
2018-06-21 19:43:00 [scrapy.core.engine] INFO: Spider closed (finished)
I have tried (\{.+$\})
as the regex expression but it gives me the the dict which is on single line, {'BOT_NAME': 'locations',..., 'USER_AGENT': 'Mozilla/5.0'}
which is not what is expected.
The json/dictionary I want to extract: Note: The dictionary would not the same keys, it could differ.
{'downloader/request_bytes': 369,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 1718,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 6, 21, 14, 13, 0, 841666),
'item_scraped_count': 4,
'log_count/INFO': 8,
'memusage/max': 56856576,
'memusage/startup': 56856576,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 6, 21, 14, 12, 58, 499385)}
Upvotes: 2
Views: 5794
Reputation: 10236
Using a JSON tokenizer makes this a very simple and efficient task, as long as you have an anchor to search for in the original document that allows you to at least identify the beginning of the JSON blob. This uses json-five to extract JSON from HTML:
import json5.tokenizer
with open('5f32d5b4e2c432f660e1df44.html') as f:
document = f.read()
search_for = "window.__INITIAL_STATE__="
i = document.index(search_for)
j = i + len(search_for)
extract_from = document[j:]
tokens = json5.tokenizer.tokenize(extract_from)
stack = []
collected = []
for token in tokens:
collected.append(token.value)
if token.type in ('LBRACE', 'LBRACKET'):
stack.append(token)
elif token.type in ('RBRACE', 'RBRACKET'):
stack.pop()
if not stack:
break
json_blob = ''.join(collected)
Note that this accounts for the JSON both being a complex (object, list) or scalar type.
Upvotes: 0
Reputation: 2398
Edit: The JSON spans multiple lines so here's what will do it:
import re
re_str = '\d{2}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} \[scrapy\.statscollectors] INFO: Dumping Scrapy stats:.({.+?\})'
stats_re = re.compile(re_str, re.MULTILINE | re.DOTALL)
for match in stats_re.findall(log):
print(match)
If you are after only the lines from the statscollector then this should get you there (assuming that it's all on one line too):
^.*?\[scrapy.statscollectors] INFO: Dumping Scrapy stats: (\{.+$\}).*?$
Upvotes: 3