Reputation: 205
values/test/10/blueprint-0.png,2089.0,545.0,2100.0,546.0
values/test/10/blueprint-0.png,2112.0,545.0,2136.0,554.0
What I want to do is read a .txt
file full of hundreds of values like the ones shared above, to create a dictionary who's key is the value of the first 2 numbers in it; my expected output:
mydict = {
'10-0': [[2089,545,2100,545,2100,546,2089,546],
[2112,545,2136,545,2136,554,2112,554]],
}
to explain how we went from 4 numbers to 8 numbers let's see them as x1
, y1
, x2
, y2
at first, and in the output they are combined as x1
, y1
, x2
, y1
, x2
, y2
, x1
, y2
In the actual data I have hundreds of values so I will have different keys if the starting 2 elements are different. Let's say if the line in the .txt file starts with values/test/10/blueprint-1.png
then the key is '10-1'
.
What I have tried:
import re
import itertools
file_data = [re.findall('\d+', i.strip('\n')) for i in open('ground_truth')]
print(file_data)
final_data = [['{}-{}'.format(a, b), list(map(float, c))] for a, b, *c in file_data]
new_data = {a: list(map(lambda x: x[-1], b)) for a, b in
itertools.groupby(sorted(final_data, key=lambda x: x[0]), key=lambda x: x[0])}
but instead I get
ValueError: not enough values to unpack (expected at least 2, got 1)
and I can't seem to fix my issue from a simple file with these 2 lines in it to the answer expected in mydict
.
Note that taking this line for example values/test/10/blueprint-0.png,2089.0,545.0,2100.0,546.0
we will find these numbers [10, 0, 2089, 0, 545, 0, 2100, 0, 546, 0]
and the 0
s in the elements 3, 5, 7 and 9 are irrelevant considering these numbers are in a list. These can be seen by printing file_data
, as I did in the code above.
Upvotes: 4
Views: 66
Reputation: 1122152
You'll need to use a more sophisticated regular expression to ignore the decimal .0
values:
re.findall(r'(?<!\.)\d+', i)
This uses a negative look-behind, to ignore any digits that are preceded by a .
. This will ignore .0
, but if there's a .01
, then those extra digits beyond the .0
(or .<digit>
) will still be picked up. For your input that should suffice.
I'd use a regular loop here to make the code more readable, and to keep the code O(N) instead of O(NlogN) (the sorting is not needed):
new_data = {}
with open('ground_truth') as f:
for line in f:
k1, k2, x1, y1, x2, y2 = map(int, re.findall(r'(?<!\.)\d+', line))
key = '{}-{}'.format(k1, k2)
new_data.setdefault(key, []).append([x1, y1, x2, y1, x2, y2, x1, y2])
I hardcoded your x, y
combinations here, as you seem to have a very specific desired order.
Demo:
>>> import re
>>> file_data = '''\
... values/test/10/blueprint-0.png,2089.0,545.0,2100.0,546.0
... values/test/10/blueprint-0.png,2112.0,545.0,2136.0,554.0
... '''
>>> new_data = {}
>>> for line in file_data.splitlines(True):
... k1, k2, x1, y1, x2, y2 = map(int, re.findall(r'(?<!\.)\d+', line))
... key = '{}-{}'.format(k1, k2)
... new_data.setdefault(key, []).append([x1, y1, x2, y1, x2, y2, x1, y2])
...
>>> new_data
{'10-0': [[2089, 545, 2100, 545, 2100, 546, 2089, 546], [2112, 545, 2136, 545, 2136, 554, 2112, 554]]}
A good alternative is to just treat your input file as the CSV format that it is! Using the csv
module is a good way to split out the columns, after which you only need to deal with the digits in the first filename column:
import csv, re
new_data = {}
with open('ground_truth') as f:
reader = csv.reader(f)
for filename, *numbers in reader:
k1, k2 = re.findall(r'\d+', filename) # no need to even convert to int
key = '{}-{}'.format(k1, k2)
x1, y1, x2, y2 = (int(float(n)) for n in numbers)
new_data.setdefault(key, []).append([x1, y1, x2, y1, x2, y2, x1, y2])
Upvotes: 5