Reputation: 982
I'm currently learning Tensorflow and for a first try (after following/trying the MINST tutorials) I would like to create a model (probably RNN) to do some basic String formatting:
I know I may not need something as complex as deep learning for the following case, but It's just for training myself.
I have a set of supposed "clean address" string in which I want to extract the actual clean address.
Hers is the kind of transformation I want to get:
RUE DE MADAGASCAR --> RUE DE MADAGASCAR
ZI DE LA PLAINE
55 RUE DU 1ER SEPTEMBRE 1944 --> 55 RUE DU 1ER SEPTEMBRE 1944
ZONE INDUSTRIELLE RUE DE LA VALLEE B.P. 8 --> RUE DE LA VALLEE
BP 62 AVENUE BECQUEREL --> AVENUE BECQUEREL
291 VOIE ATLAS --> 291 VOIE ATLAS
12 RUE ARMAND BUSQUET ZONE INDUSTRIELLE --> 12 RUE ARMAND BUSQUET
DOSSIER MLOC 5 RUE AMABLE LOZAI --> 5 RUE AMABLE LOZAI
ZI CAEN CANAL -->
RUE DE L'EUROPE ZI PORTUAIRE --> RUE DE L'EUROPE
BP 5229 BOULEVARD HENRY BECQUEREL CAMPUS JULES HOROWITZ --> BOULEVARD HENRY BECQUEREL
GIE MONSIEUR GAUTIER BOULEVARD H. BECQUEREL BP 5027 --> BOULEVARD H. BECQUEREL
21 PLACE DE LA REPUBLIQUE --> 21 PLACE DE LA REPUBLIQUE
18 RUE DE LA GIRAFE --> 18 RUE DE LA GIRAFE
21 RUE DES GOUDRIERS --> 21 RUE DES GOUDRIERS
AVENUE STRASSBURGER --> AVENUE STRASSBURGER
7 RUE DE L'EGLISE --> 7 RUE DE L'EGLISE
1060 RUE LEON FOUCAULT ZI DE LA SPHERE --> 1060 RUE LEON FOUCAULT
I you need more examples : here is a link to a spreadsheet with 200 elements (planning to expand it to 1000 - 5000 elements)
As you can see there is a lot of recognizable pattern:
BP
words and the 2 or 4 digits that come afterZI
,ZA
or Zone d'activiter
...00 (Rue|Voie|Avenue|...) nameOfStreet
I'm trying to get an output string which is a part of the input string. It shall remove word based on patterns described above.
I think that I will go on a RNN type of graph since It should detect things like, "there is a "BP" so I'm not taking this word and if the next input is a 2 or 4 digits String I'm not taking those either", I think there should be some kind of memory.
It all depends on the way I want to input my data. I think I have two or three ways of doing that:
The thing is:
If I input single words, how do I mark the string separation?
If I input entire string, It seems a bit like a lost since the
systems is only going to take or remove single word.
Does the third option (mixing the two) even make sense?
Is it possible to train in batch and use the "batch part" to input multiple words and every batch represent and address.
Also, I wonder if in my system the weight of the nodes are going to be all 0 and 1 (since it should can only take or remove single words) or if it's going to be intermediate values like a probability of keeping the word.
Thanks a lot for reading it through all that, any help would be appreciated.
Especially regarding the general direction I'm heading, and the way of inputting my data to the graph.
Upvotes: 0
Views: 221
Reputation: 451
There's two ways of approaching the problem that immediately come to mind:
If you're just starting, I would recommend the sequence tagging model. If you want to do this, the steps I would follow are:
For an example of how to do sequence tagging in tensorflow, take a look at: https://github.com/guillaumegenthial/sequence_tagging
Upvotes: 2