Firefly
Firefly

Reputation: 459

Python dict of dicts with comprehension

I have a file with lines looking like this:

"voc_sales_dac" "QVN"   "BE"    "FR"    "21513287expe"  "21513287"  "expe"  "10"    "7" "vehicule livrée mais vendeur en congé donc vehicule receptioné plus tard"
"voc_sales_dac" "QVN"   "CH"    "FR"    "21207010reco"  "21207010"  "reco"  "10"    "10"    "A ma fille"

What I'm doing is tokenize the text from field 10, first in sentences and then in separate words, to extract the initial position of each word in the text.

What I want to get is a dict like this:

maped { 21513287expe: { vehicule: 0,
                        livrée: 10,
                        mais: 17,
                        vendeur: 22,
                        en: 30,
                        congé: 33,
                        donc: 39,
                        vehicule: 44,
                        recepcioné: 53,
                        plus: 64,
                        tard: 69
                       },
        21207010reco: { A: 0,
                        ma: 3,
                        fille: 6
                      },
      }  

What I have done:

import nltk.data
from nltk.tokenize import TreebankWordTokenizer
W_tokenizer = TreebankWordTokenizer()
S_tokenizer = nltk.data.load('tokenizers/punkt/PY3/french.pickle')
import csv
import re

pattern = re.compile("[a-zá-úä-üâ-ûà-ùç]+")

with open('FR_test.csv', encoding='utf-8') as csvfile:
    reader = csv.reader(csvfile, delimiter="\t",skipinitialspace=True)
    for row in reader:
        phrases = S_tokenizer.tokenize(row[9])
        for v in phrases:
            tokens = W_tokenizer.tokenize(v)
            maped={row[4]:{w:row[9].index(w)} for w in tokens if pattern.match(w)}

Is it possible to achieve this in a dict comprehension?

Upvotes: 0

Views: 81

Answers (1)

jsan
jsan

Reputation: 2814

Try this:

def parse_phrase(phrase):
    tokens = phrase.split(' ')  # or wathever the way you want to tokenize phrase
    return {w: phrase.index(w) for w in tokens}

with open('FR_test.csv', encoding='utf-8') as csvfile:
    reader = csv.reader(csvfile, delimiter="\t",skipinitialspace=True)

    {row[4]: parse_phrase(row[9]) for row in reader}

Upvotes: 1

Related Questions