user3222101
user3222101

Reputation: 1330

Traverse a string of dictionary and store as single dictionary in python

I have a sample string that looks like dictionary but contains value which has double quotes and comma in it which makes it difficult to be read using json.loads. I am writing code to find an element and extract its value till "," is found and store it as list so the data can be converted to dataframe.

example:

filtered_data = '''[
   {
      "_id":"1231",
      "address":"akjd-dfdkfj",
      "body":"Your one time password is "sdkd". Enter this in the form to confirm your value.",
      "date":"Thu May 10 23:34:11 GMT+05:30 2018"
   },
   {
      "_id":"1245",
      "address":"sdsd-dgfg",
      "body":"Dear Customer, Reference number is 3435.To check latest status, sms DROP DFGDG on 38388338. Thank you, ABC",
      "date":"Thu May 10 13:22:54 GMT+05:30 2018"
   }
]'''

Code written so far:

import re
id_locs  = [(m.start(0), m.end(0)) for m in re.finditer('_id', filtered_data)]

How to extract value by specifying end string as ","?

expected output:

{
    "_id": [
        "1231",
        "1245"
    ],
    "_address": [
        "akjd-dfdkfj",
        "sdsd-dgfg"
    ],
    "body": [
        "Your one time password is 'sdkd'. Enter this in the form to confirm your value.",
        "Dear Customer, Reference number is 3435.To check latest status, sms DROP DFGDG on 38388338. Thank you, ABC"
    ],
    "date": [
        "Thu May 10 23:34:11 GMT+05:30 2018",
        "Thu May 10 13:22:54 GMT+05:30 2018"
    ]
}

Upvotes: 0

Views: 73

Answers (4)

tevemadar
tevemadar

Reputation: 13195

Assuming the non-escaped quotation marks occur only in "body"-lines, it can be fixed into a proper JSON, and parsed afterwards. Then you have a task of reshaping a list of dicts into a dict of lists.

import json,re

filtered_data = '''[
   {
      "_id":"1231",
      "address":"akjd-dfdkfj",
      "body":"Your one time password is "sdkd". Enter this in the form to confirm your value.",
      "date":"Thu May 10 23:34:11 GMT+05:30 2018"
   },
   {
      "_id":"1245",
      "address":"sdsd-dgfg",
      "body":"Dear Customer, Reference number is 3435.To check latest status, sms DROP DFGDG on 38388338. Thank you, ABC",
      "date":"Thu May 10 13:22:54 GMT+05:30 2018"
   }
]'''

corrected_data=re.sub("^\s*\"body\":\"(.*)\",",lambda x:"\"body\":\""+x.group(1).replace("\"","'")+"\",",filtered_data,flags=re.M)
dicts_in_list=json.loads(corrected_data)
lists_in_dict={key:[item[key] for item in dicts_in_list] for key in dicts_in_list[0].keys() }
print(lists_in_dict)

Upvotes: 1

jottbe
jottbe

Reputation: 4521

A solution with regular expressions would look like:

patt=re.compile('"([^"]*)"\s*:\s*"(.*?)"(,|\s*\})', re.MULTILINE)
result_dict=dict()
at_end= False
pos= 0
while not at_end:
    matcher= patt.search(filtered_data, pos= pos)
    at_end= matcher is None
    if not at_end:
        key, value, _= matcher.groups()
        result_dict.setdefault(key, list()).append(value)
        _, pos= matcher.span()

The assumption is, that a key, value pair always ends in '",' or '"\s*}' as in your example data.

With findall it looks a bit more compact:

patt=re.compile('"([^"]*)"\s*:\s*"(.*?)"(,|\s*\})', re.MULTILINE)
result_dict=dict()
for key, value, sep in patt.findall(filtered_data):
    result_dict.setdefault(key, list()).append(value)

Upvotes: 0

Rakesh
Rakesh

Reputation: 82765

This is one approach using Regex. Get values using lookahead & lookbehind

Ex:

import re

filtered_data = '''[
   {
      "_id":"1231",
      "address":"akjd-dfdkfj",
      "body":"Your one time password is "sdkd". Enter this in the form to confirm your value.",
      "date":"Thu May 10 23:34:11 GMT+05:30 2018"
   },
   {
      "_id":"1245",
      "address":"sdsd-dgfg",
      "body":"Dear Customer, Reference number is 3435.To check latest status, sms DROP DFGDG on 38388338. Thank you, ABC",
      "date":"Thu May 10 13:22:54 GMT+05:30 2018"
   }
]'''
keys = set(re.findall(r'\"(.+)\":', filtered_data))   #Get Keys
result = {}
for key in keys:
    result[key] = re.findall(r'(?<=\"{}":\")(.*?)(?=\",?)'.format(key), filtered_data)   #Get Values.

print(result)

Output:

{'_id': ['1231', '1245'],
 'address': ['akjd-dfdkfj', 'sdsd-dgfg'],
 'body': ['Your one time password is ',
          'Dear Customer, Reference number is 3435.To check latest status, sms '
          'DROP DFGDG on 38388338. Thank you, ABC'],
 'date': ['Thu May 10 23:34:11 GMT+05:30 2018',
          'Thu May 10 13:22:54 GMT+05:30 2018']}

Upvotes: 0

jottbe
jottbe

Reputation: 4521

If the string would form a valid json document, you can use (you probably just need to add '[' and ']' at the beginning / end of the string:

import json
str2="""[{"_id":"1231","address":"akjd-dfdkfj","body": "Your one time password is sdkd. Enter this in the form to confirm your value.","date":"Thu May 10 23:34:11 GMT+05:30 2018"},{"_id":"1245","address":"sdsd-dgfg","body":"Dear Customer, Reference number is 3435.To check latest status, sms DROP DFGDG on 38388338. Thank you, ABC","date":"Thu May 10 13:22:54 GMT+05:30 2018"}]"""
>>> json.loads(str2, encoding='UTF8')
result_dicts=json.loads(str2)

And then "merge" the dictionaries together into one, like this:

result_dict= dict()
for res_dict in result_dicts:
    for key, value in res_dict.items():
        result_dict.setdefault(key, list()).append(value)

But if your example string really looks like in your description, it is not a valid json string, because of the improper use of doublequotes (e.g. in "Your one time password is "sdkd". Enter this in the form to confirm your value."), you need to parse it yourself.

To apply regex you need to make some assumptions to cut the string in valid pieces. E.g. is it safe to assume, that a field value is always doublequoted? Or can you assume, that a field value never contains the chararcter combinations <",> and <"}> (I use <> to limit the string)?

If so, you can build your regex in such a way, that it cuts out substrings delimited by one of these strings to get the field name and field value. Without such assumptions, you cannot solve the problem.

Upvotes: 0

Related Questions