Reputation: 355
I would like to capture the addreses in (spanish) legal documents like:
import spacy
from spacy.matcher import Matcher
nlp=spacy.load("es_core_news_lg")
texto = "... domiciliado en calle 160 Nº 765 piso 2 dpto A, La Plata, don Ricardo Fabián ROSENFELD, Documento Nacional de Identidad 14.464.003 con domicilio legal en calle Belfast Nº 1435 Tandil, para que, ... "
doc = nlp(texto)
so the output should be something like:
['160 Nº 765 piso 2 dpto A, La Plata', 'ortigaz Nº 1435 Tandil']
I think that the matcher should use the fact that the relevant information starts after the word 'calle' and ends with the name of the city which is recognized by:
gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']
gpe
['La Plata', 'Belfast Nº', 'Tandil']
I thought that the algorithm should be something like:
My problem is that I do not know how to define a Matcher like this one.
Update: I solve it with the following function,
def domicilios(documento:str)->str:
"""
Funcion que identifica las direcciones para cada persona
"""
domicilios = []
for texto in documento.split('calle')[1:]:
doc = nlp(texto)
gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']
ciudad = gpe[-1]
domicilios.append([texto[:texto.find(ciudad)], ciudad])
return domicilios
domicilios(documento)
Anyway I still think that should be a way to solve it with spacy exclusively.
Upvotes: 0
Views: 380
Reputation: 106
First, create a matcher and the document object:
import spacy
from spacy.matcher import Matcher
nlp=spacy.load("es_core_news_lg")
texto = "... domiciliado en calle 160 Nº 765 piso 2 dpto A, La Plata, don Ricardo Fabián ROSENFELD, Documento Nacional de Identidad 14.464.003 con domicilio legal en calle Belfast Nº 1435 Tandil, para que, ... "
doc = nlp(texto)
gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']
Next, you need a list of triggers for the matcher to stop (the last word of a GPE, which may be multiple words) since the Matcher object only looks at single words:
gpe_ends = [loc.split()[-1] for loc in gpe]
Now, you can create a matcher that follows your algorithm:
pattern = [{"LOWER" : "calle"},
{"TEXT" : {"NOT_IN":["calle"]},"OP": "*"},
{"TEXT" : {"IN" : loc_ends}}]
The "OP":"*"
makes the matcher search greedily (as far as possible) before stopping at a location from GPE (but without allowing "calle"). The rest is from SpaCy documentation.
m = Matcher(nlp.vocab)
m.add("address", [pattern])
matches = m(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
Upvotes: 1