How to find and store the number of occurrences of substrings in strings into a Python dictionary?

Question

I have a problem, didn't know how to create a matrix

I have a dictionary of this type:

dico = {
"banana": "sp_345",
"apple": "ap_456",
"pear": "pe_645",

}

and a file like that:

sp_345_4567 pe_645_4567876  ap_456_45678    pe_645_4556789
sp_345_567  pe_645_45678
pe_645_45678    ap_456_345678
sp_345_56789    ap_456_345
pe_645_45678    ap_456_345678
sp_345_56789    ap_456_345
s45678  f45678  f456789 ap_456_52546135

What I want to do is to create a matrix where we find more than n times a value from the dictionary in the line.

This is how I want to proceed:

step 1 create a dictionary with the associated values and numbers of lines :

Like that:

dictionary = {'1': 'sp_345_4567','pe_645_4567876', 'ap_456_45678', 'pe_645_4556789'; '2': 'sp_345_567', 'pe_645_45678'; '3:' 'pe_645_45678','ap_456_345678'; '4:' etc ..

Then I want to make a comparison between the values with my first dictionary called dico and see for example in the number of times the banana key appears in each line (and therefore do it for all the keys of my dictionary) except that the problem is that the values of my dico are not equal to those of my dictionary because they are followed by this pattern'_\w+''

The idea would be to make a final_dict that would look like this to be able to make a matrix at the end:

final_dict = {'line1': 'Banana' : '1' ; 'Apple': '1'; 'Pear':2; 'line2': etc ...

Here is my code that don't work :

import pprint
import re
import csv

dico = {
    "banana": "sp_345",
    "apple": "ap_456",
    "pear": "pe_645",
}

dictionary = {}
final_dict = {}
cnt = 0
with open("test.txt") as file :
    reader = csv.reader(file, delimiter ='	')
    for li in reader:
        grp = li
        number = 1
        for li in reader:
            dictionary[number] = grp
            number += 1
            pprint.pprint(dictionary)
            number_fruit = {}
            for key1, val1 in dico.items():
                for key2, val2 in dictionary.items():
                     if val1 == val2+'_\w+':
                         final_dict[key1] = val2

Thanks for the help

EDIT :

I've tried using a dict comprehension

import csv
import re

dico = {
    "banana": "sp_345",
    "apple": "ap_456",
    "pear": "pe_645",
}

with open("test.txt") as file :
    reader = csv.reader(file, delimiter ='	')
    for li in reader:
        pattern = re.search(dico["banana"]+"_\w+", str(li))
        if pattern:
            final_dict = {"line" + str(index + 1):{key:line.count(text) for key, text in dico.items()} for index, line in enumerate(reader)}
        print(final_dict)

But when I print my final dictionary, it only put 0 for banana ...

{'line1': {'banana': 0, 'apple': 0, 'pear': 0}, 'line2': {'banana': 0, 'apple': 0, 'pear': 0}, 'line3': {'banana': 0, 'apple': 0, 'pear': 0}, 'line4': {'banana': 0, 'apple': 0, 'pear': 0}, 'line5': {'banana': 0, 'apple': 0, 'pear': 0}, 'line6': {'banana': 0, 'apple': 0, 'pear': 0}}

So yeah, now it looks like a bit more of what I wanted but the occurences doesn't rise .... :/ Maybe my condition should be inside the dict comprehension ??

Corentin Pane · Accepted Answer

Why it doesn't work

Your test

if val1 == val2+'_\w+':
    ...

doesn't work because you are testing string equality between val1 which could be "sp_345_4567" and val2+'_\w+', which is a string and could be litterally "sp_345_\w+'", and they are not equal.

What you could do about it

You might want to test for containment, for example

if val1 in val2:
    ...

You can check that "sp_345" in "sp_345_4567" returns true.

You might also want to actually count the number of times "sp_345" appears in another string, and you can do this using .count:

"sp_345_567  pe_645_45678".count("sp_345") # returns 1
"sp_345_567  pe_645_45678".count("_") # returns 2

You could also do it using regular expressions as you've tried to:

import re
pattern = "sp_345_" + "\w+"

if re.match(pattern, "sp_345_4567"):
    # pattern was found! Do stuff here.
    pass

# alternatively:
print(re.findall(pattern, "sp_345_4567"))
# prints ['sp_345_4567']

How can you apply that to build your final_dict

You can rewrite your code in a much simpler way using dictionary comprehension:

import csv

dico = {
    "banana": "sp_345",
    "apple": "ap_456",
    "pear": "pe_645",
}

with open("test.txt") as file :
    reader = csv.reader(file, delimiter ='	')
    final_dict = {"line" + str(index + 1):{key:line.count(text) for key, text in dico.items()} for index, line in enumerate(reader)}

I'm building an outer dictionary with keys like "line1", "line2"... and for each of them, the value is an inner dictionary with keys like "banana" or "apple" and each value is the number of times they appear on the line.

If you want to know how many times the banana appears on line 4, you'd use

print(final_dict["line4"]["banana"])

Note that I would recommend using a list rather than a dictionary to map results to line numbers, so that the previous query would become:

print(final_list[3]["banana"])

How to find and store the number of occurrences of substrings in strings into a Python dictionary?

Answers (1)

Related Questions