Reputation: 780
I'm wondering what people have done for Spacy when they want to break up a doc into different spans? For example say the corpus of I have created a doc object with is below. But for the task I'm doing I want to create indexing for the different sections while maintaining the original object.
doc = nlp("""
Patient History:
This is paragraph 1.
Assessment:
This is paragraph 2.
Signature:
This is paragraph 3.
""")
Then have it parsed so something like:
doc.sections_
would yield
["Patient History", "Assessment", "Signature"]
Upvotes: 2
Views: 717
Reputation: 780
This would have to come in the file step obviously, and not optimized for pipeline, but it's my slightly hacky solution.
class ParsedNoteSections(object):
"""
Pars notes into sections based on entity-tags. All sections are return as newly
created doc objects.
"""
def __init__(self,doc):
self.doc = doc
def get_section_titles(self):
"""Return the section header titles."""
return [(e,e.start, e.end) for e in self.doc.ents if e.label_ == 'NOTESECTION']
def original(self,doc):
"""Retrieve oringal doc object."""
return self.doc
def __repr__(self):
return repr(self.doc)
def parse_note_sections(self):
""" Use entity sections as break-points to split original doc.
Input:
None
Output:
List of section of objects stored in dictionary.
"""
section_titles = self.get_section_titles()
# stopgap for possible errors
assert len(section_titles) > 0
doc_section_spans = []
for idx,section in enumerate(section_titles):
section_label_new = section[0]
label_start_new = section[1]
label_end_new = section[2]
# store first label
if idx == 0:
section_label_old = section_label_new
continue
# store last section
elif idx == 1:
section_label = section_label_old
section_doc = self.doc[:label_start_new]
# if on the last section
elif idx == len(section_titles) - 1:
section_label = section_label_old
section_doc = self.doc[label_start_old:label_start_new]
doc_section_spans.append({'section_label':section_label, 'section_doc':section_doc})
section_label = section_label_new
section_doc = self.doc[label_start_new:]
# if not storing first or last section
else:
section_label = section_label_old
section_doc = self.doc[label_start_old:label_start_new]
label_start_old = label_start_new
section_label_old = section_label_new
doc_section_spans.append({'section_label':section_label, 'section_doc':section_doc})
assert len(doc_section_spans) == len(section_titles)
return doc_section_spans
Upvotes: 1
Reputation: 15623
SpaCy doesn't have any support for "sections" - they're not a universal feature of documents, and how to define them varies wildly depending on whether you're dealing with a novel, an academic paper, a newspaper, etc.
The easiest thing to do is to split up the document yourself before feeding it to spacy. If it's formatted like your example that should be easy to do using the indentation for example.
If you really want to have just one Doc object, you should be able to manage it with a pipeline extension to spaCy. See the documentation here.
Upvotes: 1