Regex to reformat improper JSON data

I have some data that are not properly saved in an old database. I am moving the system to a new database and reformatting the old data as well. The old data looks like this:

a:10:{
    s:7:"step_no";s:1:"1";
    s:9:"YOUR_NAME";s:14:"Firtname Lastname";
    s:11:"CITIZENSHIP"; s:7:"Indian";
    s:22:"PROPOSE_NAME_BUSINESS1"; s:12:"ABC Limited";
    s:22:"PROPOSE_NAME_BUSINESS2"; s:15:"XYZ Investment";
    s:22:"PROPOSE_NAME_BUSINESS3";s:0:"";
    s:22:"PROPOSE_NAME_BUSINESS4";s:0:"";
    s:23:"PURPOSE_NATURE_BUSINESS";s:15:"Some dummy content";
    s:15:"CAPITAL_COMPANY";s:24:"20 Million Capital";
    s:14:"ANOTHER_AMOUNT";s:0:"";
}

I want the new look to be in proper JSON format so I can read in python jut like this:

data = {
    "step_no": "1",
    "YOUR_NAME":"Firtname Lastname",
    "CITIZENSHIP":"Indian",
    "PROPOSE_NAME_BUSINESS1":"ABC Limited",
    "PROPOSE_NAME_BUSINESS2":"XYZ Investment",
    "PROPOSE_NAME_BUSINESS3":"",
    "PROPOSE_NAME_BUSINESS4":"",
    "PURPOSE_NATURE_BUSINESS":"Some dummy content",
    "CAPITAL_COMPANY":"20 Million Capital",
    "ANOTHER_AMOUNT":""
}

I am thinking using regex to strip out the unwanted parts and reformatting the content using the names in caps would work but I don't know how to go about this.

Upvotes: 0

Views: 52

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1124258

Regexes would be the wrong approach here. There is no need, and the format is a little more complex than you assume it is.

You have data in the PHP serialize format. You can trivially deserialise it in Python with the phpserialize library:

import phpserialize
import json

def fixup_php_arrays(o):
    if isinstance(o, dict):
        if isinstance(next(iter(o), None), int):
            # PHP has no lists, only mappings; produce a list for
            # a dictionary with integer keys to 'repair'
            return [fixup_php_arrays(o[i]) for i in range(len(o))]
        return {k: fixup_php_arrays(v) for k, v in o.items()}
    return o

json.dumps(fixup_php(phpserialize.loads(yourdata, decode_strings=True)))

Note that PHP strings are byte strings, not Unicode text, so especially in Python 3 you'd have to decode your key-value pairs after the fact if you want to be able to re-encode to JSON. The decode_strings=True flag takes care of this for you. The default is UTF-8, pass in an encoding argument to pick a different codec.

PHP also uses arrays for sequences, so you may have to convert any decoded dict object with integer keys to a list first, which is what the fixup_php_arrays() function does.

Demo (with repaired data, many string lengths were off and whitespace was added):

>>> import phpserialize, json
>>> from pprint import pprint
>>> data = b'a:10:{s:7:"step_no";s:1:"1";s:9:"YOUR_NAME";s:18:"Firstname Lastname";s:11:"CITIZENSHIP";s:6:"Indian";s:22:"PROPOSE_NAME_BUSINESS1";s:11:"ABC Limited";s:22:"PROPOSE_NAME_BUSINESS2";s:14:"XYZ Investment";s:22:"PROPOSE_NAME_BUSINESS3";s:0:"";s:22:"PROPOSE_NAME_BUSINESS4";s:0:"";s:23:"PURPOSE_NATURE_BUSINESS";s:18:"Some dummy content";s:15:"CAPITAL_COMPANY";s:18:"20 Million Capital";s:14:"ANOTHER_AMOUNT";s:0:"";}'
>>> pprint(phpserialize.loads(data, decode_strings=True))
{'ANOTHER_AMOUNT': '',
 'CAPITAL_COMPANY': '20 Million Capital',
 'CITIZENSHIP': 'Indian',
 'PROPOSE_NAME_BUSINESS1': 'ABC Limited',
 'PROPOSE_NAME_BUSINESS2': 'XYZ Investment',
 'PROPOSE_NAME_BUSINESS3': '',
 'PROPOSE_NAME_BUSINESS4': '',
 'PURPOSE_NATURE_BUSINESS': 'Some dummy content',
 'YOUR_NAME': 'Firstname Lastname',
 'step_no': '1'}
>>> print(json.dumps(phpserialize.loads(data, decode_strings=True), sort_keys=True, indent=4))
{
    "ANOTHER_AMOUNT": "",
    "CAPITAL_COMPANY": "20 Million Capital",
    "CITIZENSHIP": "Indian",
    "PROPOSE_NAME_BUSINESS1": "ABC Limited",
    "PROPOSE_NAME_BUSINESS2": "XYZ Investment",
    "PROPOSE_NAME_BUSINESS3": "",
    "PROPOSE_NAME_BUSINESS4": "",
    "PURPOSE_NATURE_BUSINESS": "Some dummy content",
    "YOUR_NAME": "Firstname Lastname",
    "step_no": "1"
}

Upvotes: 2

Related Questions