Jacob Bridges
Jacob Bridges

Reputation: 745

Splitting or stripping a variable number of characters from a line of text in Python?

I have a large amount of data of this type:

  array(14) {
    ["ap_id"]=>
    string(5) "22755"
    ["user_id"]=>
    string(4) "8872"
    ["exam_type"]=>
    string(32) "PV Technical Sales Certification"
    ["cert_no"]=>
    string(12) "PVTS081112-2"
    ["explevel"]=>
    string(1) "0"
    ["public_state"]=>
    string(2) "NY"
    ["public_zip"]=>
    string(5) "11790"
    ["email"]=>
    string(19) "[email protected]"
    ["full_name"]=>
    string(15) "Ivor Abeysekera"
    ["org_name"]=>
    string(21) "Zero Energy Homes LLC"
    ["org_website"]=>
    string(14) "www.zeroeh.com"
    ["city"]=>
    string(11) "Stony Brook"
    ["state"]=>
    string(2) "NY"
    ["zip"]=>
    string(5) "11790"
  }

I wrote a for loop in python which reads through the file, creating a dictionary for each array and storing elements like thus:

a = 0
data = [{}]

with open( "mess.txt" ) as messy:
        lines = messy.readlines()
        for i in range( 1, len(lines) ):
            line = lines[i]
            if "public_state" in line:
                data[a]['state'] = lines[i + 1]
            elif "public_zip" in line:
                data[a]['zip'] = lines[i + 1]
            elif "email" in line:
                data[a]['email'] = lines[i + 1]
            elif "full_name" in line:
                data[a]['contact'] = lines[i + 1]
            elif "org_name" in line:
                data[a]['name'] = lines[i + 1]
            elif "org_website" in line:
                data[a]['website'] = lines[i + 1]
            elif "city" in line:
                data[a]['city'] = lines[i + 1]
            elif "}" in line:
                a += 1
                data.append({})

I know my code is terrible, but I am fairly new to Python. As you can see, the bulk of my project is complete. What's left is to strip away the code tags from the actual data. For example, I need string(15) "Ivor Abeysekera" to become Ivor Abeysekera".

After some research, I considered .lstrip(), but since the preceding text is always different.. I got stuck.

Does anyone have a clever way of solving this problem? Cheers!

Edit: I am using Python 2.7 on Windows 7.

Upvotes: 2

Views: 340

Answers (4)

Kirk Strauser
Kirk Strauser

Reputation: 30947

You can do this statefully by looping across all the lines and keeping track of where you are in a block:

# Make field names to dict keys
fields = {
    'public_state': 'state',
    'public_zip': 'zip',
    'email': 'email',
    'full_name': 'contact',
    'org_name': 'name',
    'org_website': 'website',
    'city': 'city',
}

data = []
current = {}
key = None
with open( "mess.txt" ) as messy:
    for line in messy.split('\n'):
        line = line.lstrip()
        if line.startswith('}'):
            data.append(current)
            current = {}
        elif line.startswith('['):
            keyname = line.split('"')[1]
            key = fields.get(keyname)
        elif key is not None:
            # Get everything betweeen the first and last quotes on the line
            value = line.split('"', 1)[1].rsplit('"', 1)[0]
            current[key] = value

This avoids having to keep track of your position in the file, and also means that you could work across enormous data files (if you process the dictionary after each record) without having to load the whole thing into memory at once. In fact, let's restructure that as a generator that processes blocks of data at a time and yields dicts for you to work with:

fields = {
    'public_state': 'state',
    'public_zip': 'zip',
    'email': 'email',
    'full_name': 'contact',
    'org_name': 'name',
    'org_website': 'website',
    'city': 'city',
}

def dict_maker(fileobj):
    current = {}
    key = None
    for line in fileobj:
        line = line.lstrip()
        if line.startswith('}'):
            yield current
            current = {}
        elif line.startswith('['):
            keyname = line.split('"')[1]
            key = fields.get(keyname)
        elif key is not None:
            # Get everything betweeen the first and last quotes on the line
            value = line.split('"', 1)[1].rsplit('"', 1)[0]
            current[key] = value

with open("mess.txt") as messy:
    for d in dict_maker(messy):
        print d

That makes your main loop tiny and understandable: you loop across the potentially enormous set of dicts, one at a time, and do something with them. It totally separates the act of making the dictionaries from the act of consuming them. And since the generator is stateful, and only processes one line at a time, you could pass in anything that looks like a file, like a list of strings, the output of a web request, input from another programming writing to sys.stdin, or whatever.

Upvotes: 0

ror3d
ror3d

Reputation: 521

You should use regular expressions (regex) for this: http://docs.python.org/2/library/re.html

What you intend to do can be easily done with the following code:

# Import the library
import re

# This is a string just to demonstrate
a = 'string(32) "PV Technical Sales Certification"'

# Create the regex
p = re.compile('[^"]+"(.*)"$')

# Find a match
m = p.match(a)

# Your result will be now in s
s = m.group(1)

Hope this helps!

Upvotes: 1

Joran Beasley
Joran Beasley

Reputation: 113988

BAD SOLUTION Based on current question

but to answer your question just use

info_string = lines[i + 1]
value_str = info_string.split(" ",1)[-1].strip(" \"")

BETTER SOLUTION

do you have access to the php generating that .... if you do just do echo json_encode($data); instead of using var_dump

if instead you have them output json it(the json output) will look like

{"variable":"value","variable2","value2"}

you can then read it in like

import json
json_str = requests.get("http://url.com/json_dump").text  # or however you get the original text
data = json.loads(json_str)
print data

Upvotes: 2

mr2ert
mr2ert

Reputation: 5186

Depending on how the code tags are formatted, you could split the line on " then pick out the second element.

s = 'string(15) "Ivor Abeysekera"'
temp = s.split('"')[1]
# temp is 'Ivor Abeysekera'

Note that this will get rid of the trailing ", if you need it you can always just add it back on. In your example this would look like:

data[a]['state'] = lines[i + 1].split('"')[1]
# etc. for each call of lines[i + 1]

Because you are calling it so much (regardless of what answer you use) you should probably turn it into a function:

def prepare_data(line_to_fix):
    return line_to_fix.split('"')[1]
# latter on...
data[a]['state'] = prepare_data(lines[i + 1])

This will give you some more flexibility.

Upvotes: 2

Related Questions