Reputation: 176
I have some data that are not properly saved in an old database. I am moving the system to a new database and reformatting the old data as well. The old data looks like this:
a:10:{
s:7:"step_no";s:1:"1";
s:9:"YOUR_NAME";s:14:"Firtname Lastname";
s:11:"CITIZENSHIP"; s:7:"Indian";
s:22:"PROPOSE_NAME_BUSINESS1"; s:12:"ABC Limited";
s:22:"PROPOSE_NAME_BUSINESS2"; s:15:"XYZ Investment";
s:22:"PROPOSE_NAME_BUSINESS3";s:0:"";
s:22:"PROPOSE_NAME_BUSINESS4";s:0:"";
s:23:"PURPOSE_NATURE_BUSINESS";s:15:"Some dummy content";
s:15:"CAPITAL_COMPANY";s:24:"20 Million Capital";
s:14:"ANOTHER_AMOUNT";s:0:"";
}
I want the new look to be in proper JSON format so I can read in python jut like this:
data = {
"step_no": "1",
"YOUR_NAME":"Firtname Lastname",
"CITIZENSHIP":"Indian",
"PROPOSE_NAME_BUSINESS1":"ABC Limited",
"PROPOSE_NAME_BUSINESS2":"XYZ Investment",
"PROPOSE_NAME_BUSINESS3":"",
"PROPOSE_NAME_BUSINESS4":"",
"PURPOSE_NATURE_BUSINESS":"Some dummy content",
"CAPITAL_COMPANY":"20 Million Capital",
"ANOTHER_AMOUNT":""
}
I am thinking using regex to strip out the unwanted parts and reformatting the content using the names in caps would work but I don't know how to go about this.
Upvotes: 0
Views: 52
Reputation: 1124258
Regexes would be the wrong approach here. There is no need, and the format is a little more complex than you assume it is.
You have data in the PHP serialize format. You can trivially deserialise it in Python with the phpserialize
library:
import phpserialize
import json
def fixup_php_arrays(o):
if isinstance(o, dict):
if isinstance(next(iter(o), None), int):
# PHP has no lists, only mappings; produce a list for
# a dictionary with integer keys to 'repair'
return [fixup_php_arrays(o[i]) for i in range(len(o))]
return {k: fixup_php_arrays(v) for k, v in o.items()}
return o
json.dumps(fixup_php(phpserialize.loads(yourdata, decode_strings=True)))
Note that PHP strings are byte strings, not Unicode text, so especially in Python 3 you'd have to decode your key-value pairs after the fact if you want to be able to re-encode to JSON. The decode_strings=True
flag takes care of this for you. The default is UTF-8, pass in an encoding
argument to pick a different codec.
PHP also uses arrays for sequences, so you may have to convert any decoded dict
object with integer keys to a list first, which is what the fixup_php_arrays()
function does.
Demo (with repaired data, many string lengths were off and whitespace was added):
>>> import phpserialize, json
>>> from pprint import pprint
>>> data = b'a:10:{s:7:"step_no";s:1:"1";s:9:"YOUR_NAME";s:18:"Firstname Lastname";s:11:"CITIZENSHIP";s:6:"Indian";s:22:"PROPOSE_NAME_BUSINESS1";s:11:"ABC Limited";s:22:"PROPOSE_NAME_BUSINESS2";s:14:"XYZ Investment";s:22:"PROPOSE_NAME_BUSINESS3";s:0:"";s:22:"PROPOSE_NAME_BUSINESS4";s:0:"";s:23:"PURPOSE_NATURE_BUSINESS";s:18:"Some dummy content";s:15:"CAPITAL_COMPANY";s:18:"20 Million Capital";s:14:"ANOTHER_AMOUNT";s:0:"";}'
>>> pprint(phpserialize.loads(data, decode_strings=True))
{'ANOTHER_AMOUNT': '',
'CAPITAL_COMPANY': '20 Million Capital',
'CITIZENSHIP': 'Indian',
'PROPOSE_NAME_BUSINESS1': 'ABC Limited',
'PROPOSE_NAME_BUSINESS2': 'XYZ Investment',
'PROPOSE_NAME_BUSINESS3': '',
'PROPOSE_NAME_BUSINESS4': '',
'PURPOSE_NATURE_BUSINESS': 'Some dummy content',
'YOUR_NAME': 'Firstname Lastname',
'step_no': '1'}
>>> print(json.dumps(phpserialize.loads(data, decode_strings=True), sort_keys=True, indent=4))
{
"ANOTHER_AMOUNT": "",
"CAPITAL_COMPANY": "20 Million Capital",
"CITIZENSHIP": "Indian",
"PROPOSE_NAME_BUSINESS1": "ABC Limited",
"PROPOSE_NAME_BUSINESS2": "XYZ Investment",
"PROPOSE_NAME_BUSINESS3": "",
"PROPOSE_NAME_BUSINESS4": "",
"PURPOSE_NATURE_BUSINESS": "Some dummy content",
"YOUR_NAME": "Firstname Lastname",
"step_no": "1"
}
Upvotes: 2