Aus_10
Aus_10

Reputation: 780

Create Spacy Doc that has sections

I'm wondering what people have done for Spacy when they want to break up a doc into different spans? For example say the corpus of I have created a doc object with is below. But for the task I'm doing I want to create indexing for the different sections while maintaining the original object.

doc = nlp("""
Patient History:
    This is paragraph 1.
Assessment:
    This is paragraph 2.
Signature:
    This is paragraph 3.
""")

Then have it parsed so something like:

doc.sections_ 

would yield

["Patient History", "Assessment", "Signature"]

Upvotes: 2

Views: 717

Answers (2)

Aus_10
Aus_10

Reputation: 780

This would have to come in the file step obviously, and not optimized for pipeline, but it's my slightly hacky solution.

  class ParsedNoteSections(object):
    """
        Pars notes into sections based on entity-tags. All sections are return as newly
        created doc objects.
    """



    def __init__(self,doc):
        self.doc = doc

    def get_section_titles(self):
    """Return the section header titles."""
    return [(e,e.start, e.end) for e in self.doc.ents if e.label_ == 'NOTESECTION']

    def original(self,doc):
        """Retrieve oringal doc object."""
        return self.doc

    def __repr__(self):
        return repr(self.doc)

    def parse_note_sections(self):
        """ Use entity sections as break-points to split original doc.

        Input: 
            None
        Output:
            List of section of objects stored in dictionary.
        """
        section_titles = self.get_section_titles()

        # stopgap for possible errors
        assert len(section_titles) > 0

        doc_section_spans = []
        for idx,section in enumerate(section_titles):
            section_label_new = section[0]
            label_start_new = section[1]
            label_end_new = section[2]

            # store first label
            if idx == 0:
                section_label_old = section_label_new
                continue

            # store last section
            elif idx == 1:
                section_label = section_label_old
                section_doc = self.doc[:label_start_new]

            # if on the last section
            elif idx == len(section_titles) - 1:
                section_label = section_label_old
                section_doc = self.doc[label_start_old:label_start_new]
                doc_section_spans.append({'section_label':section_label, 'section_doc':section_doc})

                section_label = section_label_new
                section_doc = self.doc[label_start_new:]

            # if not storing first or last section
            else:
                section_label = section_label_old
                section_doc = self.doc[label_start_old:label_start_new]

            label_start_old = label_start_new
            section_label_old = section_label_new

            doc_section_spans.append({'section_label':section_label, 'section_doc':section_doc})

        assert len(doc_section_spans) == len(section_titles)

        return doc_section_spans

Upvotes: 1

polm23
polm23

Reputation: 15623

SpaCy doesn't have any support for "sections" - they're not a universal feature of documents, and how to define them varies wildly depending on whether you're dealing with a novel, an academic paper, a newspaper, etc.

The easiest thing to do is to split up the document yourself before feeding it to spacy. If it's formatted like your example that should be easy to do using the indentation for example.

If you really want to have just one Doc object, you should be able to manage it with a pipeline extension to spaCy. See the documentation here.

Upvotes: 1

Related Questions