Reputation: 745
I have a large amount of data of this type:
array(14) {
["ap_id"]=>
string(5) "22755"
["user_id"]=>
string(4) "8872"
["exam_type"]=>
string(32) "PV Technical Sales Certification"
["cert_no"]=>
string(12) "PVTS081112-2"
["explevel"]=>
string(1) "0"
["public_state"]=>
string(2) "NY"
["public_zip"]=>
string(5) "11790"
["email"]=>
string(19) "[email protected]"
["full_name"]=>
string(15) "Ivor Abeysekera"
["org_name"]=>
string(21) "Zero Energy Homes LLC"
["org_website"]=>
string(14) "www.zeroeh.com"
["city"]=>
string(11) "Stony Brook"
["state"]=>
string(2) "NY"
["zip"]=>
string(5) "11790"
}
I wrote a for loop in python which reads through the file, creating a dictionary for each array and storing elements like thus:
a = 0
data = [{}]
with open( "mess.txt" ) as messy:
lines = messy.readlines()
for i in range( 1, len(lines) ):
line = lines[i]
if "public_state" in line:
data[a]['state'] = lines[i + 1]
elif "public_zip" in line:
data[a]['zip'] = lines[i + 1]
elif "email" in line:
data[a]['email'] = lines[i + 1]
elif "full_name" in line:
data[a]['contact'] = lines[i + 1]
elif "org_name" in line:
data[a]['name'] = lines[i + 1]
elif "org_website" in line:
data[a]['website'] = lines[i + 1]
elif "city" in line:
data[a]['city'] = lines[i + 1]
elif "}" in line:
a += 1
data.append({})
I know my code is terrible, but I am fairly new to Python. As you can see, the bulk of my project is complete. What's left is to strip away the code tags from the actual data. For example, I need string(15) "Ivor Abeysekera"
to become Ivor Abeysekera"
.
After some research, I considered .lstrip()
, but since the preceding text is always different.. I got stuck.
Does anyone have a clever way of solving this problem? Cheers!
Edit: I am using Python 2.7 on Windows 7.
Upvotes: 2
Views: 340
Reputation: 30947
You can do this statefully by looping across all the lines and keeping track of where you are in a block:
# Make field names to dict keys
fields = {
'public_state': 'state',
'public_zip': 'zip',
'email': 'email',
'full_name': 'contact',
'org_name': 'name',
'org_website': 'website',
'city': 'city',
}
data = []
current = {}
key = None
with open( "mess.txt" ) as messy:
for line in messy.split('\n'):
line = line.lstrip()
if line.startswith('}'):
data.append(current)
current = {}
elif line.startswith('['):
keyname = line.split('"')[1]
key = fields.get(keyname)
elif key is not None:
# Get everything betweeen the first and last quotes on the line
value = line.split('"', 1)[1].rsplit('"', 1)[0]
current[key] = value
This avoids having to keep track of your position in the file, and also means that you could work across enormous data files (if you process the dictionary after each record) without having to load the whole thing into memory at once. In fact, let's restructure that as a generator that processes blocks of data at a time and yields dicts for you to work with:
fields = {
'public_state': 'state',
'public_zip': 'zip',
'email': 'email',
'full_name': 'contact',
'org_name': 'name',
'org_website': 'website',
'city': 'city',
}
def dict_maker(fileobj):
current = {}
key = None
for line in fileobj:
line = line.lstrip()
if line.startswith('}'):
yield current
current = {}
elif line.startswith('['):
keyname = line.split('"')[1]
key = fields.get(keyname)
elif key is not None:
# Get everything betweeen the first and last quotes on the line
value = line.split('"', 1)[1].rsplit('"', 1)[0]
current[key] = value
with open("mess.txt") as messy:
for d in dict_maker(messy):
print d
That makes your main loop tiny and understandable: you loop across the potentially enormous set of dicts, one at a time, and do something with them. It totally separates the act of making the dictionaries from the act of consuming them. And since the generator is stateful, and only processes one line at a time, you could pass in anything that looks like a file, like a list of strings, the output of a web request, input from another programming writing to sys.stdin
, or whatever.
Upvotes: 0
Reputation: 521
You should use regular expressions (regex) for this: http://docs.python.org/2/library/re.html
What you intend to do can be easily done with the following code:
# Import the library
import re
# This is a string just to demonstrate
a = 'string(32) "PV Technical Sales Certification"'
# Create the regex
p = re.compile('[^"]+"(.*)"$')
# Find a match
m = p.match(a)
# Your result will be now in s
s = m.group(1)
Hope this helps!
Upvotes: 1
Reputation: 113988
BAD SOLUTION Based on current question
but to answer your question just use
info_string = lines[i + 1]
value_str = info_string.split(" ",1)[-1].strip(" \"")
BETTER SOLUTION
do you have access to the php generating that .... if you do just do echo json_encode($data);
instead of using var_dump
if instead you have them output json it(the json output) will look like
{"variable":"value","variable2","value2"}
you can then read it in like
import json
json_str = requests.get("http://url.com/json_dump").text # or however you get the original text
data = json.loads(json_str)
print data
Upvotes: 2
Reputation: 5186
Depending on how the code tags are formatted, you could split the line on "
then pick out the second element.
s = 'string(15) "Ivor Abeysekera"'
temp = s.split('"')[1]
# temp is 'Ivor Abeysekera'
Note that this will get rid of the trailing "
, if you need it you can always just add it back on. In your example this would look like:
data[a]['state'] = lines[i + 1].split('"')[1]
# etc. for each call of lines[i + 1]
Because you are calling it so much (regardless of what answer you use) you should probably turn it into a function:
def prepare_data(line_to_fix):
return line_to_fix.split('"')[1]
# latter on...
data[a]['state'] = prepare_data(lines[i + 1])
This will give you some more flexibility.
Upvotes: 2