pioupiou1211
pioupiou1211

Reputation: 373

Grammar NLTK for numbers

I'm coding something in order to analyse a list (or a dictionary/tuple) with elements which are strings or numbers. But i'm having an issue: I can analyse simple numbers (from 0 to 9) but not others. Here is my code:

grammaire = nltk.CFG.fromstring("""
    L -> OPEN CONTENT CLOSE
    OPEN -> "["
    CLOSE -> "]"
    CONTENT -> Element Seq |   
    Seq -> | S Element Seq
    S -> ","
    Element -> Word | nombre | T | L | D
    T -> "(" BeginTuple ")"
    BeginTuple -> ElementTuple S ElementTuple EndTuple
    EndTuple -> S ElementTuple |  
    ElementTuple -> nombre | T
    D -> "{" BeginDic "}"
    BeginDic -> ElementDic EndDic
    EndDic -> S ElementDic EndDic |
    ElementDic -> Key ":" Value
    Key -> Word
    Value -> nombre | T | L
    Word -> "Bonjour" | "Aurevoir" | "Bye" | "Cya" | "Coucou" | " " | "Hello" | "Hi" 
    nombre -> chiffre | chiffre nombre
    chiffre ->  '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
    """)

sent,res,elmt = "[{Bonjour:1,Hello:(1,2)}]",[],''
c = '()[]{}:,'
for x in sent:
    if x in c:
        if len(elmt) == 0:
            res += [x,]
        else:
            #try: res += [int(elmt),] #si c'est un nombre on le transforme en int
            #except: res += [elmt,]
            res += [elmt,]
            elmt = ""
            res += [x,]
    else:
        elmt += x
print(res)

The importants lines are in the beginning, with "chiffre" and "nombre". What am I doing wrong? Plus, I need to do the same with strings (so chiffre will be ' "a" | "b" | "c"... ' and nombre will be the same).

I tried to take in my list the numbers as Int and not as Str but it doesn't work... (cf the commented region with the try/except). Ofc then I draw the tree of that.

Upvotes: 0

Views: 753

Answers (1)

alexis
alexis

Reputation: 50180

The narrow answer to your question is that your tokenizer groups multi-digit numbers as single tokens. If you tokenize each digit separately, it will work. More generally, you should tackle the task of tokenization more thoroughly; your code is too brittle to support things like treating quote-delimited strings as single tokens, for example.

However: Why are you trying to parse a string representation of an arbitrary python list? Don't do it. If you're reading data you wrote yourself, write it out in simpler form so that you can read it easily. E.g., does each record consist of a label and a list of numbers? Write each record as one space-delimited row. That's trivial to read in and parse.

For data with more complicated structure, use json to write out your file and read it back in. It handles all the parsing for you.

Upvotes: 1

Related Questions