Budi Mulyo
Budi Mulyo

Reputation: 384

python: split sentence for annotation

lists :

matrixA = []
matrixB = []

sentences :

sentences 1 = "words1 words2 words3 {matrixA} {matrixB}"
sentences 2 = "words3 words4  {matrixA}"
etc..

result :

matrixA = "words1 words2 words3", "words3 words4"
matrixB = "words1 words2 words3"
etc..

any idea, library that support ? import re, nltk, or ? can do it manually, but if i use library i think more faster.

Upvotes: 1

Views: 200

Answers (1)

cs95
cs95

Reputation: 402263

First, if you have many sentences, it would be sensible to put it inside a list:

sentences = ["words1 words2 words3 {matrixA} {matrixB}", "words3 words4  {matrixA}"]

Next, for varying variable names such as Matrix*, I'd recommend using a defaultdict of lists from the collections package.

from collections import defaultdict
matrices = defaultdict(list)  

Now, comes the loop. To get a list of names in each sentence, use re.findall. Then, for each variable name found, append the words to the list for that variable name in matrices.

import re

for s in sentences:
    for m in re.findall("{(.*?)}", s):
        matrices[m].append(s.split('{', 1)[0])

print(dict(matrices))
{
    "matrixA": [
        "words1 words2 words3 ",
        "words3 words4  "
    ],
    "matrixB": [
        "words1 words2 words3 "
    ]
}

Which seems to be what you're looking for.

If you don't want trailing spaces, append s.split('{', 1)[0].strip(), calling str.strip to get rid of leading/trailing whitespace characters.

Upvotes: 1

Related Questions