ATM
ATM

Reputation: 51

How to extract a JSON object enclosed between paragraphs of string?

I have the following string:

...some random text...

{
   "1":"one",
   "2":"two",
   "3":{
      "31":{
         "311":"threeoneone",
         "312":"threeonetwo",
         "313":"threeonethree"
      }
   },
   "4":{
      "41":"fourone",
      "42":"fourtwo",
      "43":"fourthree"
   },
   "5":"five",
   "6":"six"
}

...some more random text...

How can I extract the JSON from this? This is what I want to get.

{
  "1": "one",
  "2": "two",
  "3": {
    "31": {
      "311": "threeoneone",
      "312": "threeonetwo",
      "313": "threeonethree"
    }
  },
  "4": {
    "41": "fourone",
    "42": "fourtwo",
    "43": "fourthree"
  },
  "5": "five",
  "6": "six"
}

Is there a Pythonic way of getting this done?

Upvotes: 3

Views: 1433

Answers (3)

blhsing
blhsing

Reputation: 107085

A more robust solution to finding JSON objects in a file with mixed content without any assumption of the content (the non-JSON content may contain unpaired curly brackets, and the JSON content may contain strings that contain unpaired curly brackets, and there may be multiple JSON objects, etc.) would be to iteratively try parsing any substring starting with a curly bracket { with the json.JSONDecoder.raw_decode method, which allows extra data after a JSON document. Since this method takes a starting index as a second argument, which the regular decode method does not have, we can provide this index in a function closure instead. And since this method also returns the index at which the valid JSON document ends, we can use the index as a starting index for finding the next substring starting with a {:

import json

def RawJSONDecoder(index):
    class _RawJSONDecoder(json.JSONDecoder):
        end = None

        def decode(self, s, *_):
            data, self.__class__.end = self.raw_decode(s, index)
            return data
    return _RawJSONDecoder

def extract_json(s, index=0):
    while (index := s.find('{', index)) != -1:
        try:
            yield json.loads(s, cls=(decoder := RawJSONDecoder(index)))
            index = decoder.end
        except json.JSONDecodeError:
            index += 1

So that:

s = '''...some {{bad brackets} and empty brackets {} <= still valid JSON though...

{
   "1":"one",
   "2":"two",
   "3":{
      "31":{
         "311":"threeoneone",
         "312":"threeonetwo",
         "313":"threeonethree"
      }
   },
   "4":{
      "41":"fourone",
      "42":"fourtwo",
      "43":"fourthree"
   },
   "5":"five",
   "6":"six"
}

...some more random text...'''
print(*extract_json(s), sep='\n')

outputs:

{}
{'1': 'one', '2': 'two', '3': {'31': {'311': 'threeoneone', '312': 'threeonetwo', '313': 'threeonethree'}}, '4': {'41': 'fourone', '42': 'fourtwo', '43': 'fourthree'}, '5': 'five', '6': 'six'}

Demo: https://ideone.com/4aat8z

Upvotes: 6

andreihondrari
andreihondrari

Reputation: 5833

You could use regex for this by identifying the json like:

import re
import json

text = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis lacinia efficitur metus, eget finibus leo venenatis non. Sed id massa luctus, hendrerit mauris id, auctor tortor.

{
   "1":"one",
   "2":"two",
   "3":{
      "31":{
         "311":"threeoneone",
         "312":"threeonetwo",
         "313":"threeonethree"
      }
   },
   "4":{
      "41":"fourone",
      "42":"fourtwo",
      "43":"fourthree"
   },
   "5":"five",
   "6":"six"
}

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis lacinia efficitur metus, eget finibus leo venenatis non. Sed id massa luctus, hendrerit mauris id, auctor tortor.Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis lacinia efficitur metus, eget finibus leo venenatis non. Sed id massa luctus, hendrerit mauris id, auctor tortor.
"""

result = re.search(r'[a-zA-Z0-9 ,.\n]+(\{[a-zA-Z0-9 \":\{\},\n]+\})[a-zA-Z0-9 ,.\n]+', text)

try:
    json_string = result.group(1)
    json_data = json.loads(json_string)
    print(json_data)
except IndexError:
    print("No json found!")

Upvotes: 0

miara
miara

Reputation: 897

Assuming the JSON is not malformed, and assuming all content enclosed inside curly braces are JSON objects:

jsons = [] 
with open(f) as o:
    parse_to_json = "" 
    for line in o:
        if line == "{":
            parsing_json_flag = True
        if parsing_json_flag:
            parse_to_json += line
        if line == "}":
            parsing_json_flag = False
            parse_to_json = "" 
            jsons.append(parse_to_json)

Now, convert all strings inside the array jsons with your favorite JSON parsing library.

Upvotes: 0

Related Questions