Jess
Jess

Reputation: 205

How can I transform my data with numbers to a dictionary containing lists of lists?

values/test/10/blueprint-0.png,2089.0,545.0,2100.0,546.0
values/test/10/blueprint-0.png,2112.0,545.0,2136.0,554.0

What I want to do is read a .txt file full of hundreds of values like the ones shared above, to create a dictionary who's key is the value of the first 2 numbers in it; my expected output:

mydict = {
    '10-0': [[2089,545,2100,545,2100,546,2089,546], 
             [2112,545,2136,545,2136,554,2112,554]],
}

to explain how we went from 4 numbers to 8 numbers let's see them as x1, y1, x2, y2 at first, and in the output they are combined as x1, y1, x2, y1, x2, y2, x1, y2

In the actual data I have hundreds of values so I will have different keys if the starting 2 elements are different. Let's say if the line in the .txt file starts with values/test/10/blueprint-1.png then the key is '10-1'.

What I have tried:

import re

import itertools

file_data = [re.findall('\d+', i.strip('\n')) for i in open('ground_truth')]
print(file_data)
final_data = [['{}-{}'.format(a, b), list(map(float, c))] for a, b, *c in file_data]
new_data = {a: list(map(lambda x: x[-1], b)) for a, b in
            itertools.groupby(sorted(final_data, key=lambda x: x[0]), key=lambda x: x[0])}

but instead I get

ValueError: not enough values to unpack (expected at least 2, got 1)

and I can't seem to fix my issue from a simple file with these 2 lines in it to the answer expected in mydict.

Note that taking this line for example values/test/10/blueprint-0.png,2089.0,545.0,2100.0,546.0 we will find these numbers [10, 0, 2089, 0, 545, 0, 2100, 0, 546, 0] and the 0s in the elements 3, 5, 7 and 9 are irrelevant considering these numbers are in a list. These can be seen by printing file_data, as I did in the code above.

Upvotes: 4

Views: 66

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122152

You'll need to use a more sophisticated regular expression to ignore the decimal .0 values:

re.findall(r'(?<!\.)\d+', i)

This uses a negative look-behind, to ignore any digits that are preceded by a .. This will ignore .0, but if there's a .01, then those extra digits beyond the .0 (or .<digit>) will still be picked up. For your input that should suffice.

I'd use a regular loop here to make the code more readable, and to keep the code O(N) instead of O(NlogN) (the sorting is not needed):

new_data = {}
with open('ground_truth') as f:
    for line in f:
        k1, k2, x1, y1, x2, y2 = map(int, re.findall(r'(?<!\.)\d+', line))
        key = '{}-{}'.format(k1, k2)
        new_data.setdefault(key, []).append([x1, y1, x2, y1, x2, y2, x1, y2])

I hardcoded your x, y combinations here, as you seem to have a very specific desired order.

Demo:

>>> import re
>>> file_data = '''\
... values/test/10/blueprint-0.png,2089.0,545.0,2100.0,546.0
... values/test/10/blueprint-0.png,2112.0,545.0,2136.0,554.0
... '''
>>> new_data = {}
>>> for line in file_data.splitlines(True):
...     k1, k2, x1, y1, x2, y2 = map(int, re.findall(r'(?<!\.)\d+', line))
...     key = '{}-{}'.format(k1, k2)
...     new_data.setdefault(key, []).append([x1, y1, x2, y1, x2, y2, x1, y2])
...
>>> new_data
{'10-0': [[2089, 545, 2100, 545, 2100, 546, 2089, 546], [2112, 545, 2136, 545, 2136, 554, 2112, 554]]}

A good alternative is to just treat your input file as the CSV format that it is! Using the csv module is a good way to split out the columns, after which you only need to deal with the digits in the first filename column:

import csv, re

new_data = {}
with open('ground_truth') as f:
    reader = csv.reader(f)
    for filename, *numbers in reader:
        k1, k2 = re.findall(r'\d+', filename)  # no need to even convert to int
        key = '{}-{}'.format(k1, k2)
        x1, y1, x2, y2 = (int(float(n)) for n in numbers)
        new_data.setdefault(key, []).append([x1, y1, x2, y1, x2, y2, x1, y2])

Upvotes: 5

Related Questions