Reputation: 384
lists :
matrixA = []
matrixB = []
sentences :
sentences 1 = "words1 words2 words3 {matrixA} {matrixB}"
sentences 2 = "words3 words4 {matrixA}"
etc..
result :
matrixA = "words1 words2 words3", "words3 words4"
matrixB = "words1 words2 words3"
etc..
any idea, library that support ? import re, nltk, or ? can do it manually, but if i use library i think more faster.
Upvotes: 1
Views: 200
Reputation: 402263
First, if you have many sentences, it would be sensible to put it inside a list
:
sentences = ["words1 words2 words3 {matrixA} {matrixB}", "words3 words4 {matrixA}"]
Next, for varying variable names such as Matrix*
, I'd recommend using a defaultdict
of lists from the collections
package.
from collections import defaultdict
matrices = defaultdict(list)
Now, comes the loop. To get a list of names in each sentence, use re.findall
. Then, for each variable name found, append the words to the list for that variable name in matrices
.
import re
for s in sentences:
for m in re.findall("{(.*?)}", s):
matrices[m].append(s.split('{', 1)[0])
print(dict(matrices))
{
"matrixA": [
"words1 words2 words3 ",
"words3 words4 "
],
"matrixB": [
"words1 words2 words3 "
]
}
Which seems to be what you're looking for.
If you don't want trailing spaces, append s.split('{', 1)[0].strip()
, calling str.strip
to get rid of leading/trailing whitespace characters.
Upvotes: 1