Reputation: 322
I have a python dictionary that contains a list of terms as values:
myDict = {
ID_1: ['(dog|cat[a-z+]|horse)', '(car[a-z]+|house|apple\w)', '(bird|tree|panda)'],
ID_2: ['(horse|building|computer)', '(panda\w|lion)'],
ID_3: ['(wagon|tiger|cat\w*)'],
ID_4: ['(dog)']
}
I want to be able to read the the list-items in each value as individual regular expressions and if they match any text, have the matched text returned as keys in a separate dictionary with their original keys (the IDs) as the values.
So if these terms were read as regexes for searching this string:
"dog panda cat cats pandas car carts"
The general approach I have in mind is something like:
for key, value in myDict:
for item in value:
if re.compile(item) = match-in-text:
newDict[match] = [list of keys]
The expected output would be:
newDict = {
car: [ID_1],
carts: [ID_1],
dog: [ID_1, ID_4],
panda: [ID_1, ID_2],
pandas: [ID_1, ID_2],
cat: [ID_1, ID_3],
cats: [ID_1, ID_3]
}
The matched text should be returned as a key in newDict only if they've actually matched something in the body of text. So in the output, 'Carts' is listed there since the regex in ID_1's values matched with it. And therefore the ID is listed in the output dict.
Upvotes: 4
Views: 2593
Reputation: 120598
Here's a simple script that seems to fit your requirements:
import re
from collections import defaultdict
text = """
the eye of the tiger
a dog in the manger
the cat in the hat
a kingdom for my horse
a bird in the hand
"""
myDict = {
'ID_1': ['(dog|cat|horse)', '(car|house|apples)', '(bird|tree|panda)'],
'ID_2': ['(horse|building|computer)', '(panda|lion)'],
'ID_3': ['(wagon|tiger|cat)'],
'ID_4': ['(dog)'],
}
newDict = defaultdict(list)
for key, values in myDict.items():
for pattern in values:
for match in re.finditer(pattern, text):
newDict[match.group(0)].append(key)
for item in newDict.items():
print(item)
output:
('dog', ['ID_1', 'ID_4'])
('cat', ['ID_1', 'ID_3'])
('horse', ['ID_1', 'ID_2'])
('bird', ['ID_1'])
('tiger', ['ID_3'])
Upvotes: 3
Reputation: 375475
One way is to convert the regex into vanilla lists e.g. with string manipulation:
In [11]: {id_: "|".join(ls).replace("(", "").replace(")", "").split("|") for id_, ls in myDict.items()}
Out[11]:
{'ID_1': ['dog',
'cat',
'horse',
'car',
'house',
'apples',
'bird',
'tree',
'panda'],
'ID_2': ['horse', 'building', 'computer', 'panda', 'lion'],
'ID_3': ['wagon', 'tiger', 'cat'],
'ID_4': ['dog']}
You can make this into a DataFrame:
In [12]: from collections import Counter
In [13]: pd.DataFrame({id_:Counter( "|".join(ls).replace("(", "").replace(")", "").split("|") ) for id_, ls in myDict.items()}).fillna(0).astype(int)
Out[13]:
ID_1 ID_2 ID_3 ID_4
apples 1 0 0 0
bird 1 0 0 0
building 0 1 0 0
car 1 0 0 0
cat 1 0 1 0
computer 0 1 0 0
dog 1 0 0 1
horse 1 1 0 0
house 1 0 0 0
lion 0 1 0 0
panda 1 1 0 0
tiger 0 0 1 0
tree 1 0 0 0
wagon 0 0 1 0
Upvotes: 1