Spacy Matcher for adress recognition of spanish text

Question

I would like to capture the addreses in (spanish) legal documents like:

import spacy
from spacy.matcher import Matcher
nlp=spacy.load("es_core_news_lg")

texto = "... domiciliado en calle 160 Nº 765 piso 2 dpto A, La Plata, don Ricardo Fabián ROSENFELD, Documento Nacional de Identidad 14.464.003 con domicilio legal en calle Belfast Nº 1435 Tandil, para que, ... "

doc = nlp(texto)

so the output should be something like:

['160 Nº 765 piso 2 dpto A, La Plata', 'ortigaz Nº 1435 Tandil']

I think that the matcher should use the fact that the relevant information starts after the word 'calle' and ends with the name of the city which is recognized by:

gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']
gpe

['La Plata', 'Belfast Nº', 'Tandil']

I thought that the algorithm should be something like:

Look for the word 'calle'
Take the name in the gpe list that is as far as possible from 'calle' but before the next appearence of 'calle'.
Take the all text between this two words.

My problem is that I do not know how to define a Matcher like this one.

Update: I solve it with the following function,

def domicilios(documento:str)->str:
    """
    Funcion que identifica las direcciones para cada persona
    """
    domicilios = []
    for texto in documento.split('calle')[1:]:
        doc = nlp(texto)
        gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']
        ciudad = gpe[-1]
        domicilios.append([texto[:texto.find(ciudad)], ciudad])
    return domicilios

domicilios(documento)

Anyway I still think that should be a way to solve it with spacy exclusively.

Spacy Matcher for adress recognition of spanish text

Answers (1)

Related Questions