Joe
Joe

Reputation: 909

Using spaCy 3.0 to convert data from old Spacy v2 format to the brand new Spacy v3 format

I have the variable trainData which has the following simplified format.

[

('Paragraph_A', {"entities": [(15, 26, 'DiseaseClass'), (443, 449, 'DiseaseClass'), (483, 496, 'DiseaseClass')]}),
('Paragraph_B', {"entities": [(969, 975, 'DiseaseClass'), (1257, 1271, 'SpecificDisease')]}),
('Paragraph_C', {"entities": [(0, 27, 'SpecificDisease')]})
]

I am trying to convert trainData to .spacy by converting it first in doc and then to DocBin. The whole trainData file is accessible via GoogleDocs.

I tried to reproduce what was mentioned in this tutorial but did not work for me. The tutorial is: Using spaCy 3.0 to build a custom NER model


I tried the following.

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

for text, annot in trainData: # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        ents.append(span)
    doc.ents = span # label the text with the ents
    db.add(doc)

db.to_disk("./train.spacy") # save the docbin object

But I am mistaken in my code of how to make conversion of the data from Spacy v2 to Spacy v3. In the above code snippet, I got a traceback: TypeError: 'spacy.tokens.token.Token' object is not iterable.

Upvotes: 3

Views: 2666

Answers (2)

Joe
Joe

Reputation: 909

I found the problem in the following abstract's entities:

[Machado-Joseph disease, Machado-Joseph disease, MJD, MJD, MJD, MJD, Huntington disease, HD, HD, MJD, Machado-Joseph disease, Machado-Joseph disease, MJD, MJD, MJD, MJD, Huntington disease, HD, HD, MJD]

following the abstract:

8528200|t|Evidence for inter-generational instability in the CAG repeat in the MJD1 gene and for conserved haplotypes at flanking markers amongst Japanese and Caucasian subjects with Machado-Joseph disease.
8528200|a|The size of the (CAG)n repeat array in the 3' end of the MJD1 gene and the haplotype at a series of microsatellite markers surrounding the MJD1 gene were examined in a large cohort of Japanese and Caucasian subjects affected with Machado-Joseph disease (MJD). Our data provide five novel observations. First, MJD is associated with expansion fo the array from the normal range of 14-37 repeats to 68-84 repeats in most Japanese and Caucasian subjects, but no subjects were observed with expansions intermediate in size between those of the normal and MJD affected groups. Second, the expanded allele associated with MJD displays inter-generational instability, particularly in male meioses, and this instability was associated with the clinical phenomenon of anticipation. Third, the size of the expanded allele is not only inversely correlated with the age-of-onset of MJD (r = -0.738, p < 0.001), but is also correlated with the frequency of other clinical features [e.g. pseudoexophthalmos and pyramidal signs were more frequent in subjects with large repeats (p < 0.001 and p < 0.05 respectively)]. Fourth, the disease phenotype is significantly more severe and had an early age of onset (16 years) in a subject homozygous for the expanded allele, which contrasts with Huntington disease and suggests that the expanded allele in the MJD1 gene could exert its effect either by a dominant negative effect (putatively excluded in HD) or by a gain of function effect as proposed for HD. Finally, Japanese and Caucasian subjects affected with MJD share haplotypes at several markers surrounding the MJD1 gene, which are uncommon in the normal Japanese and Caucasian population, and which suggests the existence either of common founders in these populations or of chromosomes susceptible to pathologic expansion of the CAG repeat in the MJD1 gene.
8528200 173 195 Machado-Joseph disease  SpecificDisease D017827
8528200 427 449 Machado-Joseph disease  SpecificDisease D017827
8528200 451 454 MJD SpecificDisease D017827
8528200 506 509 MJD SpecificDisease D017827
8528200 748 751 MJD Modifier    D017827
8528200 813 816 MJD SpecificDisease D017827
8528200 1067    1070    MJD SpecificDisease D017827
8528200 1470    1488    Huntington disease  SpecificDisease D006816
8528200 1628    1630    HD  SpecificDisease D006816
8528200 1680    1682    HD  SpecificDisease D006816
8528200 1739    1742    MJD SpecificDisease D017827

where t stands for a title and a stands for abstract. We need to concatenate them.


def converter(data, outputFile):
    """
    Converts data to the new Spacy v3 format; .spacy binary format
    Inputs: 
        data: data should in the format of: (abstract, {'entities' : [(start, end, label), (start, end, label)]})
        outputFile: file name output
    Outputs:
        {outputFile}.spacy format file
    """
    nlp = spacy.blank("en") # load a new spacy model
    doc_bin = DocBin() # create a DocBin object

    for text, annot in tqdm(data): # data in previous format
        doc = nlp.make_doc(text) # create doc object from text    
        ents = []
        
        for start, end, label in annot["entities"]: # add character indexes
            # supported modes: strict, contract, expand
            span = doc.char_span(start, end, label=label, alignment_mode="strict")
            if span is None:
                # here only ignore the spans that are None; I skip those entities
                pass
            else:
                ents.append(span)
        try:
            doc.ents = ents # label the text with the ents
        except:
            # here only ignore the following abstract entities is ignored;
            # [Machado-Joseph disease, Machado-Joseph disease, MJD, MJD, MJD, MJD, Huntington disease, 
            # HD, HD, MJD, Machado-Joseph disease, Machado-Joseph disease, MJD, MJD, MJD, MJD, 
            # Huntington disease, HD, HD, MJD]
            pass
        doc_bin.add(doc)
        
    doc_bin.to_disk(f"./{outputFile}.spacy") # save the docbin object
    return f"Processed {len(doc_bin)}"

The function converter() works well but I am ignoring the entity aforementioned. I still do not know how to deal with such a case to let spaCy be able to not consider it as repetition instead of just ignoring it.

Upvotes: 1

polm23
polm23

Reputation: 15593

You have a minor bug. Check the XXX for the changed line.

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

for text, annot in trainData: # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        ents.append(span)
    #XXX FOLLOWING LINE CHANGED
    doc.ents = ents # label the text with the ents
    db.add(doc)

db.to_disk("./train.spacy") # save the docbin object

Upvotes: 4

Related Questions