How to search for multiple data from multiple lines and store them in dictionary?

Question

Say I have a file with the following:

/* Full name: abc */
.....
.....(.....)
.....(".....) ;
/* .....
/* .....
..... : "....."
}
"....., .....
Car : true ;
House : true ;
....
....
Age : 33
....
/* Full name: xyz */
....
....
Car : true ;
....
....
Age : 56
....

I am only interested in full name, car, house and age of each person. There are many other lines of data with different format between the variable/attritbute that I am interested.

My code so far:

import re

initial_val = {'House': 'false', 'Car': 'false'}

with open('input.txt') as f:
    records = []
    current_record = None
    for line in f:
        if not line.strip():
            continue
        elif current_record is None:
            people_name = re.search('.+Full name ?: (.+) ', line)
            if people_name:
                current_record = dict(initial_val, Name = people_name.group(1))
            else:
                continue
        elif current_record is not None:
            house = re.search(' *(House) ?: ?([a-z]+)', line)
            if house:
                current_record['House'] = house.group(2)
            car = re.search(' *(Car) ?: ?([a-z]+)', line)
            if car:
                current_record['Car'] = car.group(2)
            people_name = re.search('.+Full name ?: (.+) ', line)
            if people_name:
                records.append(current_record)
                current_record = dict(initial_val, Name = people_name.group(1))                       

print records

What I get:

[{'Name': 'abc', 'House': 'true', 'Car': 'true'}]

My question:

How am I suppose to extract the data and store it in a dictionary like:

{'abc': {'Car': true, 'House': true, 'Age': 33}, 'xyz':{'Car': true, 'House': false, 'Age': 56}}

My purpose:

check whether each person has car, house and age, if no then return false

The I could print them in a table like this:

Name Car House Age
abc true true 33
xyz true false 56

Note that I am using Python 2.7 and I do not know what is the actual value of each variable/attribute (Eg. abc, true, true, 33) of each person.

What is the best solution to my question? Thanks.

Bakuriu · Accepted Answer

Well, you just have to keep track of the current record:

def parse_name(line):
    # first remove the initial '/* ' and final ' */'
    stripped_line = line.strip('/* ')
    return stripped_line.split(':')[-1]


WANTED_KEYS = ('Car', 'Age', 'House')

# default values for when the lines are not present for a record
INITIAL_VAL = {'Car': False, 'House': False, Age: -1}

with open('the_filename') as f:
    records = []
    current_record = None

    for line in f:
        if not line.strip():
             # skip empty lines
             continue
        elif current_record is None:
             # first record in the file
             if line.startswith('/*'):
                 current_record = dict(INITIAL_VAL, name=parse_name(line))
             else:
                 # this should probably be an error in the file contents
                 continue
        elif line.startswith('/*'):
            # this means that the current record finished, and a new one is starting
            records.append(current_record)
            current_record = dict(INITIAL_VAL, name=parse_name(line))
        else:
            key, val = line.split(':')
            if key.strip() in WANTED_KEYS:
                # we want to keep track of this field
                current_record[key.strip()] = val.strip()
            # otherwise just ignore the line


print('Name	Car	House	Age')
for record in records:
    print(record['name'], record['Car'], record['House'], record['Age'], sep='	')

Note that for Age you may want to convert it to an integer using int:

if key == 'Age':
    current_record['Age'] = int(val)

The above code produces a list of dictionaries, but it is easy enough to convert it to a dictionary of dicts:

new_records = {r['name']: dict(r) for r in records}
for val in new_records.values():
    del val['name']

After this new_records will be something like:

{'abc': {'Car': True, 'House': True, Age: 20}, ...}

If you have other lines with a different format in between the interesting ones you can simply write a function that returns True or False depending on whether the line is in the format you require and use it to filter the lines of the file:

def is_interesting_line(line):
    if line.startswith('/*'):
        return True
    elif ':' in line:
        return True

for line in filter(is_interesting_line, f):
    # code as before

Change is_interesting_line to suit your needs. In the end, if you have to handle several different formats etc. maybe using a regex would be better, in that case you could do something like:

import re

LINE_REGEX = re.compile(r'(/\*.*\*/)|(\w+\s*:.*)| ')

def is_interesting_line(line):
    return LINE_REGEX.match(line) is not None

If you want you can obtain fancier formatting for the table, but you probably first need to determine the maximum length of the name etc. or you can use something like tabulate to do that for you.

For example something like (not tested):

max_name_length = max(max(len(r['name']) for r in records), 4)
format_string = '{:<{}}	{:<{}}	{}	{}'
    print(format_string.format('Name', max_name_length, 'Car', 5,  'House', 'Age'))
    for record in records:
        print(format_string.format(record['name'], max_name_length, record['Car'], 5, record['House'], record['Age']))

How to search for multiple data from multiple lines and store them in dictionary?

Answers (1)

Related Questions