Drethax
Drethax

Reputation: 75

Is it possible to extract specific({Architect, Building}) information from unstructured text chunks using NLP?

I am working on a task to extract architects and their buildings from unstructured pieces of texts with varying sizes. I started trying a NLP tool called SpaCy but annotations they provide sometimes mixes up.

FAC Buildings, airports, highways, bridges, etc.

ORG Companies, agencies, institutions, etc.

GPE Countries, cities, states.

LOC Non-GPE locations, mountain ranges, bodies of water.

Building names falls into those 4 annotations. My job would be so much easier if i could get only FAC for building names but it looks like it is not possible or i couldn't be able to make it work.

The question is, is it even possible to use NLP tools to extract such information tuples(in my case {Architect, Building}) from a chunk of text?

Edit: Some things i have done

Following bits are some examples of texts i am using at the moment

He renovated Fatih Mosque and built Laleli Mosque in the name of Sultan Mustafa III

Mehmed Tahir Ağa built Hamidiyye Complex in Bahçekapı for Sultan Abdülhamid I.

I am giving those texts as data to the spaCy, code bit is here:

for i in range(len(data)):
text = data[i]
text = re.sub(r'\([^()]*\)', '', text)
doc = nlp(text)

#Extract ORG, GPE, LOC and FAC labels from phrases
for entity in doc.ents:
    if entity.label_ in ('ORG', 'GPE', 'LOC', 'FAC'):
        #Manual filtering of results
        if entity.text not in ("Istanbul", "Egypt", "Hicaz", "Palestine", "Syria", "Balkans", "Albania", "Malta", "Spain", "Bosnia", "Frengistan", "Kırım", "Belgrade", "Damascus"):
            print(entity.text, entity.label_)

Output is:

Laleli Mosque ORG

Hamidiyye Complex ORG

Bahçekapı for Sultan Abdülhamid I. ORG

Upvotes: 0

Views: 167

Answers (1)

oneextrafact
oneextrafact

Reputation: 189

It depends on how closely the text you're working with follows the structure of the data that SpaCy's default models were trained with. If they're very different, you might have to train your own model instead of using theirs. The guys behind SpaCy (explosion.ai) have a paid tool that can help you do this (prodi.gy). That said, it probably is possible to do what you want to do, but putting together a training set without tool support is not a very easy thing to do.

Upvotes: 0

Related Questions