Reputation: 93
I have a problem, didn't know how to create a matrix
I have a dictionary of this type:
dico = {
"banana": "sp_345",
"apple": "ap_456",
"pear": "pe_645",
}
and a file like that:
sp_345_4567 pe_645_4567876 ap_456_45678 pe_645_4556789
sp_345_567 pe_645_45678
pe_645_45678 ap_456_345678
sp_345_56789 ap_456_345
pe_645_45678 ap_456_345678
sp_345_56789 ap_456_345
s45678 f45678 f456789 ap_456_52546135
What I want to do is to create a matrix where we find more than n times a value from the dictionary in the line.
This is how I want to proceed:
step 1 create a dictionary with the associated values and numbers of lines :
Like that:
dictionary = {'1': 'sp_345_4567','pe_645_4567876', 'ap_456_45678', 'pe_645_4556789'; '2': 'sp_345_567', 'pe_645_45678'; '3:' 'pe_645_45678','ap_456_345678'; '4:' etc ..
Then I want to make a comparison between the values with my first dictionary called dico and see for example in the number of times the banana key appears in each line (and therefore do it for all the keys of my dictionary) except that the problem is that the values of my dico are not equal to those of my dictionary because they are followed by this pattern'_\w+''
The idea would be to make a final_dict that would look like this to be able to make a matrix at the end:
final_dict = {'line1': 'Banana' : '1' ; 'Apple': '1'; 'Pear':2; 'line2': etc ...
Here is my code that don't work :
import pprint
import re
import csv
dico = {
"banana": "sp_345",
"apple": "ap_456",
"pear": "pe_645",
}
dictionary = {}
final_dict = {}
cnt = 0
with open("test.txt") as file :
reader = csv.reader(file, delimiter ='\t')
for li in reader:
grp = li
number = 1
for li in reader:
dictionary[number] = grp
number += 1
pprint.pprint(dictionary)
number_fruit = {}
for key1, val1 in dico.items():
for key2, val2 in dictionary.items():
if val1 == val2+'_\w+':
final_dict[key1] = val2
Thanks for the help
EDIT :
I've tried using a dict comprehension
import csv
import re
dico = {
"banana": "sp_345",
"apple": "ap_456",
"pear": "pe_645",
}
with open("test.txt") as file :
reader = csv.reader(file, delimiter ='\t')
for li in reader:
pattern = re.search(dico["banana"]+"_\w+", str(li))
if pattern:
final_dict = {"line" + str(index + 1):{key:line.count(text) for key, text in dico.items()} for index, line in enumerate(reader)}
print(final_dict)
But when I print my final dictionary, it only put 0 for banana ...
{'line1': {'banana': 0, 'apple': 0, 'pear': 0}, 'line2': {'banana': 0, 'apple': 0, 'pear': 0}, 'line3': {'banana': 0, 'apple': 0, 'pear': 0}, 'line4': {'banana': 0, 'apple': 0, 'pear': 0}, 'line5': {'banana': 0, 'apple': 0, 'pear': 0}, 'line6': {'banana': 0, 'apple': 0, 'pear': 0}}
So yeah, now it looks like a bit more of what I wanted but the occurences doesn't rise .... :/ Maybe my condition should be inside the dict comprehension ??
Upvotes: 1
Views: 87
Reputation: 4943
Why it doesn't work
Your test
if val1 == val2+'_\w+':
...
doesn't work because you are testing string equality between val1
which could be "sp_345_4567"
and val2+'_\w+'
, which is a string and could be litterally "sp_345_\w+'"
, and they are not equal.
What you could do about it
if val1 in val2:
...
You can check that "sp_345" in "sp_345_4567"
returns true
.
"sp_345"
appears in another string, and you can do this using .count
:"sp_345_567 pe_645_45678".count("sp_345") # returns 1
"sp_345_567 pe_645_45678".count("_") # returns 2
import re
pattern = "sp_345_" + "\\w+"
if re.match(pattern, "sp_345_4567"):
# pattern was found! Do stuff here.
pass
# alternatively:
print(re.findall(pattern, "sp_345_4567"))
# prints ['sp_345_4567']
How can you apply that to build your final_dict
You can rewrite your code in a much simpler way using dictionary comprehension:
import csv
dico = {
"banana": "sp_345",
"apple": "ap_456",
"pear": "pe_645",
}
with open("test.txt") as file :
reader = csv.reader(file, delimiter ='\t')
final_dict = {"line" + str(index + 1):{key:line.count(text) for key, text in dico.items()} for index, line in enumerate(reader)}
I'm building an outer dictionary with keys like "line1"
, "line2"
... and for each of them, the value is an inner dictionary with keys like "banana"
or "apple"
and each value is the number of times they appear on the line.
If you want to know how many times the banana
appears on line 4
, you'd use
print(final_dict["line4"]["banana"])
Note that I would recommend using a list rather than a dictionary to map results to line numbers, so that the previous query would become:
print(final_list[3]["banana"])
Upvotes: 1