Reputation: 55
I need to parse a text table which has the following format:
-----------------------------------------
| Serial | Name | marks |
| Number |First | Middle | Last | |
-----------------------------------------
| 1 | john | s | doe | 56 |
| 2 | jim | d | bill| 60 |
After parsing the table, the output should be a nested dictionary with the data as lists.
TableData = {'Serial Number':[1,2],
'Name': {'First':[john, jim]}
{'Middle':[s, d]}
{'Last':[doe, bill]}
'marks': [56, 60]
}
As of now I have logic to get the positions of the delimiters (|), and I can extract the text in between the delimiters.
posList = [[0,9,32,40],[0,9,16,25,32]]
nameList = [['Serial','Name','marks'],['Number ','First','Middle','Last',' ']]
But I am having difficulty converting this to the nested dictionary structure.
Upvotes: 4
Views: 9321
Reputation: 18521
If you know what the data structure should look like, then can't you forget about the first 3 rows and extract data from the rest of the rows? For example, assuming the table is located in a text file table_file
, then
table_data = {'Serial Number':[],
'Name':{'First': [],
'Middle': []
'Last': []},
'Marks': []}
with open(table_file, 'r') as table:
# skip first 3 rows
for _ in range(3):
table.next()
for row in table:
row = row.strip('\n').split('|')
values = [r.strip() for r in row if r != '']
assert len(values) == 5
table_data['Serial Number'].append(int(values[0]))
table_data['Name']['First'].append(values[1])
table_data['Name']['Middle'].append(values[2])
table_data['Name']['Last'].append(values[3])
table_data['Marks'].append(values[4])
EDIT:
To construct the table_data dictionary, consider the following pseudocode. Fair warning, I tested this and it seems to work for your example and should work for anything with two rows of header. However, it is sloppy because I wrote in about 10 minutes. However, it could be an OK start from which you can improve and expand. This also assumes you have code for extracting pos_list
and name_list
.
for itertools import tee, izip
def pairwise(iterable):
a, b = tee(iterable)
next(b, None)
return izip(a, b)
def create_table_dict(pos_list, name_list):
intervals = []
for sub_list in pos_list:
intervals.append(list(pairwise(sub_list)))
items = []
for interval, name in zip(intervals, name_list):
items.append([ (i, n) for i, n in zip(interval, name) ])
names = []
for int1, name1 in items[0]:
past_names = []
for int2, name2 in items[1]:
if int1[0] == int2[0]:
if int1[1] == int2[1]:
names.append(' '.join((name1, name2)).strip())
elif int2[1] < int1[1]:
past_names.append(name2)
elif int1[0] < int2[0]:
if int2[1] < int1[1]:
past_names.append(name2)
elif int1[1] == int2[1]:
names.append('{0}:{1}'.format(name1,
','.join(past_names + [name2])))
table = {}
for name in names:
if ':' not in name:
table[name] = []
else:
upper, nested = name.split(':')
nested = nested.split(',')
table[upper] = {}
for n in nested:
table[upper][n] = []
print table
Upvotes: 4