Federico Vega
Federico Vega

Reputation: 355

Spacy Matcher for adress recognition of spanish text

I would like to capture the addreses in (spanish) legal documents like:

import spacy
from spacy.matcher import Matcher
nlp=spacy.load("es_core_news_lg")

texto = "... domiciliado en calle 160 Nº 765 piso 2 dpto A, La Plata, don Ricardo Fabián ROSENFELD, Documento Nacional de Identidad 14.464.003 con domicilio legal en calle Belfast Nº 1435 Tandil, para que, ... "

doc = nlp(texto)

so the output should be something like:

['160 Nº 765 piso 2 dpto A, La Plata', 'ortigaz Nº 1435 Tandil']

I think that the matcher should use the fact that the relevant information starts after the word 'calle' and ends with the name of the city which is recognized by:

gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']
gpe

['La Plata', 'Belfast Nº', 'Tandil']

I thought that the algorithm should be something like:

  1. Look for the word 'calle'
  2. Take the name in the gpe list that is as far as possible from 'calle' but before the next appearence of 'calle'.
  3. Take the all text between this two words.

My problem is that I do not know how to define a Matcher like this one.

Update: I solve it with the following function,

def domicilios(documento:str)->str:
    """
    Funcion que identifica las direcciones para cada persona
    """
    domicilios = []
    for texto in documento.split('calle')[1:]:
        doc = nlp(texto)
        gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']
        ciudad = gpe[-1]
        domicilios.append([texto[:texto.find(ciudad)], ciudad])
    return domicilios

domicilios(documento)

Anyway I still think that should be a way to solve it with spacy exclusively.

Upvotes: 0

Views: 380

Answers (1)

Elijah Cox
Elijah Cox

Reputation: 106

First, create a matcher and the document object:

import spacy
from spacy.matcher import Matcher
nlp=spacy.load("es_core_news_lg")

texto = "... domiciliado en calle 160 Nº 765 piso 2 dpto A, La Plata, don Ricardo Fabián ROSENFELD, Documento Nacional de Identidad 14.464.003 con domicilio legal en calle Belfast Nº 1435 Tandil, para que, ... "
doc = nlp(texto)
gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']

Next, you need a list of triggers for the matcher to stop (the last word of a GPE, which may be multiple words) since the Matcher object only looks at single words:

gpe_ends = [loc.split()[-1] for loc in gpe]

Now, you can create a matcher that follows your algorithm:

pattern = [{"LOWER" : "calle"},
           {"TEXT"  : {"NOT_IN":["calle"]},"OP": "*"},
           {"TEXT"  : {"IN"    : loc_ends}}]

The "OP":"*" makes the matcher search greedily (as far as possible) before stopping at a location from GPE (but without allowing "calle"). The rest is from SpaCy documentation.

m = Matcher(nlp.vocab)
m.add("address", [pattern])
matches = m(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

Upvotes: 1

Related Questions