Retrieve many nested statements at the same time for a single object with lxml Python

Question

I am working with big xml where I am retrieving many different properties, and now I am trying to retrieve comment category property and connect it to the text between the tags. However, there are 3 different situations that I need to handle. XML example:


  Peripheral blood 
 
   Epstein-Barr virus (EBV)
 
  Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194)

When does not have child tags under. Then I need to retrieve comment category property and connect it with the text between the tags.
When has a tag nested underneath. Then I need to retrieve comment category, cv-term terminology, cv-term accession and the text between the cv-term tags.
When has several tags nested underneath: --- -. In this case I need to retrieve: comment category, xref database property, xref accession property, and property value property.

I am using lxml to parse this XML, and I am struggling to wrap my head around how to solve case 2. Case 1 and 3 work but when an object has all three cases then the output gets messed up.

I would like to receive following output:

Derived from sampling site: Peripheral blood
Transformant: NCBI-Taxonomy, 10376, Epstein-Barr virus (EBV)
Sequence variation: Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194)
Monoclonal antibody target: UniProtKB, Q5T5X7, Human BEND3

Here is my very messy code which outpus the elements in wrong order. It worked fine for case 1 and 3, but when case 2 comes into play then the output is ordered wrong:

comment_cat = att.xpath('.//comment-list/comment/@category')
comment_text = att.xpath('.//comment-list/comment/text()') 
cv_term = att.xpath('.//comment-list/comment/cv-term/text()')
xref = [a + ', ' + b for a,b in zip(att.xpath('.//comment-list/comment/xref- 
list/xref/@database'),att.xpath('.//comment-list/comment/xref-list/xref/@accession'))]
property_list = att.xpath('.//comment-list/comment/xref-list/xref/property-list/property/@value')
xref_property_list = [a + ', ' + b for a,b in zip(xref, property_list)]
empty_str_in_text = ['
      ', '
    ', '
      ', '
    ']
comment_texts_all = cv_term+comment_text+xref_property_list

for e in empty_str_in_text:
    if e in comment_texts_all:
        comment_texts_all.remove(e)    
key_values['Comments'] = ';; '.join([i + ': ' + j for i, j in zip(comment_cat, 
comment_texts_all)])

Output:

Derived from sampling site: Epstein-Barr virus (EBV);; 
Transformant:  Peripheral blood ;; 
Sequence variation:  Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194) ;; 
Monoclonal antibody target: UniProtKB, Q5T5X7, Human BEND3

Alexandra Dudkina · Accepted Answer

Here is a slightly alternative approach:

    xml = '''
     Peripheral blood 
    
        Epstein-Barr virus (EBV)
    
     Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194)
    
        
            
                
                    
                
                
            
        
    
    
        KO mouse
        
            
                
                    
                
                
            
        
    
'''

from lxml import etree as ET

tree = ET.fromstring(xml)

result = ''

for comment in tree.iter('comment'):
    result += f"{comment.get('category')}: "
    cv_term = comment.find('cv-term')
    xref_list = comment.find('xref-list')
    method = comment.find('method')
    if len(list(comment)) == 0:
        result += comment.text
    elif cv_term is not None:
        result += ', '.join([cv_term.get('terminology'), cv_term.get('accession'), cv_term.text])
    elif xref_list is not None and method is None:
        result += ', '.join([xref_list.xpath('./xref/@database')[0], xref_list.xpath('./xref/@accession')[0], xref_list.xpath('./xref/property-list/property/@value')[0]])
    elif method is not None:
        result += method.text
    result += '
'

print(result)

Output:

Derived from sampling site:  Peripheral blood 
Transformant: NCBI-Taxonomy, 10376, Epstein-Barr virus (EBV)
Sequence variation:  Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194)
Monoclonal antibody target: UniProtKB, Q5T5X7, Human BEND3
Knockout cell: KO mouse

Retrieve many nested statements at the same time for a single object with lxml Python

Answers (1)

Related Questions