Reputation: 459
I have a file with lines looking like this:
"voc_sales_dac" "QVN" "BE" "FR" "21513287expe" "21513287" "expe" "10" "7" "vehicule livrée mais vendeur en congé donc vehicule receptioné plus tard"
"voc_sales_dac" "QVN" "CH" "FR" "21207010reco" "21207010" "reco" "10" "10" "A ma fille"
What I'm doing is tokenize the text from field 10, first in sentences and then in separate words, to extract the initial position of each word in the text.
What I want to get is a dict like this:
maped { 21513287expe: { vehicule: 0,
livrée: 10,
mais: 17,
vendeur: 22,
en: 30,
congé: 33,
donc: 39,
vehicule: 44,
recepcioné: 53,
plus: 64,
tard: 69
},
21207010reco: { A: 0,
ma: 3,
fille: 6
},
}
What I have done:
import nltk.data
from nltk.tokenize import TreebankWordTokenizer
W_tokenizer = TreebankWordTokenizer()
S_tokenizer = nltk.data.load('tokenizers/punkt/PY3/french.pickle')
import csv
import re
pattern = re.compile("[a-zá-úä-üâ-ûà-ùç]+")
with open('FR_test.csv', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter="\t",skipinitialspace=True)
for row in reader:
phrases = S_tokenizer.tokenize(row[9])
for v in phrases:
tokens = W_tokenizer.tokenize(v)
maped={row[4]:{w:row[9].index(w)} for w in tokens if pattern.match(w)}
Is it possible to achieve this in a dict comprehension?
Upvotes: 0
Views: 81
Reputation: 2814
Try this:
def parse_phrase(phrase):
tokens = phrase.split(' ') # or wathever the way you want to tokenize phrase
return {w: phrase.index(w) for w in tokens}
with open('FR_test.csv', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter="\t",skipinitialspace=True)
{row[4]: parse_phrase(row[9]) for row in reader}
Upvotes: 1