user12806098
user12806098

Reputation:

Specific data extraction from xml using python

I want to collect specific information from data.xml with root[0] 'CaplockSet' contain more than 100 'Caplock' in which I need only author information to be extracted! Kindly help me with this, your support is highly appreciated.

<?xml version="1.0"?>

<CaplockSet>

<Caplock>
    <MedlineCitation Status="clonelisher" Owner="NLM">
        <PMID Version="1">32045906</PMID>
        <DateRevised>
            <Year>2020</Year>
            <Month>02</Month>
            <Day>11</Day>
        </DateRevised>
        <Article cloneModel="Print-Electronic">
            <Journal>
                <ISSN IssnType="Electronic">1423-0135</ISSN>
                <JournalIssue CitedMedium="Internet">
                    <cloneDate>
                        <Year>2020</Year>
                        <Month>Feb</Month>
                        <Day>11</Day>
                    </cloneDate>
                </JournalIssue>
                <Title>Journal of vascular research</Title>
                <ISOAbbreviation>J. Vasc. Res.</ISOAbbreviation>
            </Journal>
            <ArticleTitle>miR-96-5p Regulates Proliferation, Migration, and Apoptosis of Vascular Smooth Muscle Cell Induced by Angiotensin II via Targeting NFAT5.</ArticleTitle>
            <Pagination>
                <MedlinePgn>1-11</MedlinePgn>
            </Pagination>
            <ELocationID EIdType="doi" ValidYN="Y">10.1159/000505457</ELocationID>
            <Abstract>
                <AbstractText Label="BACKGROUND" NlmCategory="BACKGROUND">Aberrant proliferation, migration, and apoptosis of vascular smooth muscle cells (VSMCs) are major pathological phenomenon in hypertension. MicroRNAs (miRNAs/miRs) serve crucial roles in the progression of hypertension. We aimed to determine the role of miR-96-5p in the proliferation, migration, and apoptosis of VSMCs and its underlying mechanisms.</AbstractText>
                <AbstractText Label="METHODS" NlmCategory="METHODS">Angiotensin II (Ang II) was employed to treat VSMCs, and the expression of miR-96-5p was detected by RT-qPCR. Then, miR-96-5p mimic was transfected into VSMCs. Cell Counting Kit-8 assay, flow cytometry, transwell assay, and wound healing assay were applied to measure proliferation, cell cycle, and migration of VSMCs. The expression of proteins associated with proliferation, migration, and apoptosis was assessed. A luciferase reporter assay was applied to confirm the target binding between miR-96-5p and nuclear factors of activated T-cells 5 (NFAT5). Subsequently, siRNA was used to silence NFAT5, and cell proliferation, migration, and apoptosis were assessed.</AbstractText>
                <AbstractText Label="RESULTS" NlmCategory="RESULTS">The results revealed that the expression of miR-96-5p was downregulated in Ang II-induced VSMCs. MiR-96-5p overexpression inhibited cell proliferation and migration but promoted cell apoptosis, enhanced the percentages of cells in the G1 and G2 phases, and reduced those in the S phase, accompanied by changes in the expression associated proteins. NFAT5 was confirmed as a direct target of miR-96-5p. NFAT5 silencing had the same results with miR-96-5p overexpression on VSMC proliferation, migration, and apoptosis, whereas miR-96-5p inhibitor reversed these effects.</AbstractText>
                <AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">Our findings concluded that miR-96-5p could regulate proliferation, migration, and apoptosis of VSMCs induced by Ang II via targeting NFAT5.</AbstractText>
                <CopyrightInformation>© 2020 S. Karger AG, Basel.</CopyrightInformation>
            </Abstract>
            <AuthorList CompleteYN="Y">
                <Author ValidYN="Y">
                    <LastName>Tian</LastName>
                    <ForeName>Long</ForeName>
                    <Initials>L</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Cardiology, Jiangdu People's Hospital, Yangzhou, China.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Cai</LastName>
                    <ForeName>Dinghua</ForeName>
                    <Initials>D</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Cardiology, Jiangdu People's Hospital, Yangzhou, China.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Zhuang</LastName>
                    <ForeName>Derong</ForeName>
                    <Initials>D</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Cardiology, Jiangdu People's Hospital, Yangzhou, China.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Wang</LastName>
                    <ForeName>Wenyuan</ForeName>
                    <Initials>W</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Cardiology, Jiangdu People's Hospital, Yangzhou, China.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Wang</LastName>
                    <ForeName>Xuan</ForeName>
                    <Initials>X</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Cardiology, Jiangdu People's Hospital, Yangzhou, China.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Bian</LastName>
                    <ForeName>Xiaoli</ForeName>
                    <Initials>X</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Cardiology, Jiangdu People's Hospital, Yangzhou, China.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Xu</LastName>
                    <ForeName>Rui</ForeName>
                    <Initials>R</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Nephrology, Jiangdu People's Hospital, Yangzhou, China.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Wu</LastName>
                    <ForeName>Guanji</ForeName>
                    <Initials>G</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Cardiology, Xi'an Central Hospital of Xi'an Jiaotong University, Xi'an, China, [email protected].</Affiliation>
                    </AffiliationInfo>
                </Author>
            </AuthorList>
            <Language>eng</Language>
            <clonelicationTypeList>
                <clonelicationType UI="D016428">Journal Article</clonelicationType>
            </clonelicationTypeList>
            <ArticleDate DateType="Electronic">
                <Year>2020</Year>
                <Month>02</Month>
                <Day>11</Day>
            </ArticleDate>
        </Article>
        <MedlineJournalInfo>
            <Country>Switzerland</Country>
            <MedlineTA>J Vasc Res</MedlineTA>
            <NlmUniqueID>9206092</NlmUniqueID>
            <ISSNLinking>1018-1172</ISSNLinking>
        </MedlineJournalInfo>
        <CitationSubset>IM</CitationSubset>
        <KeywordList Owner="NOTNLM">
            <Keyword MajorTopicYN="N">Migration</Keyword>
            <Keyword MajorTopicYN="N">NFAT5</Keyword>
            <Keyword MajorTopicYN="N">Proliferation</Keyword>
            <Keyword MajorTopicYN="N">Vascular smooth muscle cell</Keyword>
            <Keyword MajorTopicYN="N">miR-96-5p</Keyword>
        </KeywordList>
    </MedlineCitation>
    <CardData>
        <History>
            <CardcloneDate cloneStatus="received">
                <Year>2019</Year>
                <Month>09</Month>
                <Day>16</Day>
            </CardcloneDate>
            <CardcloneDate cloneStatus="accepted">
                <Year>2019</Year>
                <Month>12</Month>
                <Day>16</Day>
            </CardcloneDate>
            <CardcloneDate cloneStatus="entrez">
                <Year>2020</Year>
                <Month>2</Month>
                <Day>12</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </CardcloneDate>
            <CardcloneDate cloneStatus="Card">
                <Year>2020</Year>
                <Month>2</Month>
                <Day>12</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </CardcloneDate>
            <CardcloneDate cloneStatus="medline">
                <Year>2020</Year>
                <Month>2</Month>
                <Day>12</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </CardcloneDate>
        </History>
        <clonelicationStatus>aheadofprint</clonelicationStatus>
        <ArticleIdList>
            <ArticleId IdType="Card">32045906</ArticleId>
            <ArticleId IdType="pii">000505457</ArticleId>
            <ArticleId IdType="doi">10.1159/000505457</ArticleId>
        </ArticleIdList>
    </CardData>
</Caplock>


</CaplockSet>

I tried multiple ways to get away with this .py code but am facing lot of errors. I elaborated one of the method below

import xml.etree.ElementTree as ET

mytree = ET.parse('data.xml')
myroot = mytree.getroot()
for x in myroot.findall('Author'):
    lastname = x.find('LastName').text
    forename = x.find('ForeName').text
    affiliation = x.find('AffiliationInfo/Affiliation').text

    print(lastname,forename,affiliation)

Error

Traceback (most recent call last):
  File "c:/Users/jeeva/Desktop/data/program.py", line 3, in <module>
    mytree = ET.parse('data/data.xml')
  File "C:\Users\jeeva\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1202, in parse
    tree.parse(source, parser)
  File "C:\Users\jeeva\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 595, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: syntax error: line 2, column 21

Upvotes: 2

Views: 84

Answers (2)

balderman
balderman

Reputation: 23815

One liner:

import xml.etree.ElementTree as ET

xml = '''<?xml version="1.0"?>
<CaplockSet>
<Caplock>
    <MedlineCitation Status="clonelisher" Owner="NLM">
        <PMID Version="1">32045906</PMID>
        <DateRevised>
            <Year>2020</Year>
            <Month>02</Month>
            <Day>11</Day>
        </DateRevised>
        <Article cloneModel="Print-Electronic">
            <Journal>
                <ISSN IssnType="Electronic">1423-0135</ISSN>
                <JournalIssue CitedMedium="Internet">
                    <cloneDate>
                        <Year>2020</Year>
                        <Month>Feb</Month>
                        <Day>11</Day>
                    </cloneDate>
                </JournalIssue>
                <Title>Journal of vascular research</Title>
                <ISOAbbreviation>J. Vasc. Res.</ISOAbbreviation>
            </Journal>
            <ArticleTitle>miR-96-5p Regulates Proliferation, Migration, and Apoptosis of Vascular Smooth Muscle Cell Induced by Angiotensin II via Targeting NFAT5.</ArticleTitle>
            <Pagination>
                <MedlinePgn>1-11</MedlinePgn>
            </Pagination>
            <ELocationID EIdType="doi" ValidYN="Y">10.1159/000505457</ELocationID>
            <Abstract>
                <AbstractText Label="BACKGROUND" NlmCategory="BACKGROUND">Aberrant proliferation, migration, and apoptosis of vascular smooth muscle cells (VSMCs) are major pathological phenomenon in hypertension. MicroRNAs (miRNAs/miRs) serve crucial roles in the progression of hypertension. We aimed to determine the role of miR-96-5p in the proliferation, migration, and apoptosis of VSMCs and its underlying mechanisms.</AbstractText>
                <AbstractText Label="METHODS" NlmCategory="METHODS">Angiotensin II (Ang II) was employed to treat VSMCs, and the expression of miR-96-5p was detected by RT-qPCR. Then, miR-96-5p mimic was transfected into VSMCs. Cell Counting Kit-8 assay, flow cytometry, transwell assay, and wound healing assay were applied to measure proliferation, cell cycle, and migration of VSMCs. The expression of proteins associated with proliferation, migration, and apoptosis was assessed. A luciferase reporter assay was applied to confirm the target binding between miR-96-5p and nuclear factors of activated T-cells 5 (NFAT5). Subsequently, siRNA was used to silence NFAT5, and cell proliferation, migration, and apoptosis were assessed.</AbstractText>
                <AbstractText Label="RESULTS" NlmCategory="RESULTS">The results revealed that the expression of miR-96-5p was downregulated in Ang II-induced VSMCs. MiR-96-5p overexpression inhibited cell proliferation and migration but promoted cell apoptosis, enhanced the percentages of cells in the G1 and G2 phases, and reduced those in the S phase, accompanied by changes in the expression associated proteins. NFAT5 was confirmed as a direct target of miR-96-5p. NFAT5 silencing had the same results with miR-96-5p overexpression on VSMC proliferation, migration, and apoptosis, whereas miR-96-5p inhibitor reversed these effects.</AbstractText>
                <AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">Our findings concluded that miR-96-5p could regulate proliferation, migration, and apoptosis of VSMCs induced by Ang II via targeting NFAT5.</AbstractText>
                <CopyrightInformation>© 2020 S. Karger AG, Basel.</CopyrightInformation>
            </Abstract>
            <AuthorList CompleteYN="Y">
                <Author ValidYN="Y">
                    <LastName>Tian</LastName>
                    <ForeName>Long</ForeName>
                    <Initials>L</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Cardiology, Jiangdu People's Hospital, Yangzhou, China.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Cai</LastName>
                    <ForeName>Dinghua</ForeName>
                    <Initials>D</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Cardiology, Jiangdu People's Hospital, Yangzhou, China.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Zhuang</LastName>
                    <ForeName>Derong</ForeName>
                    <Initials>D</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Cardiology, Jiangdu People's Hospital, Yangzhou, China.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Wang</LastName>
                    <ForeName>Wenyuan</ForeName>
                    <Initials>W</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Cardiology, Jiangdu People's Hospital, Yangzhou, China.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Wang</LastName>
                    <ForeName>Xuan</ForeName>
                    <Initials>X</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Cardiology, Jiangdu People's Hospital, Yangzhou, China.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Bian</LastName>
                    <ForeName>Xiaoli</ForeName>
                    <Initials>X</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Cardiology, Jiangdu People's Hospital, Yangzhou, China.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Xu</LastName>
                    <ForeName>Rui</ForeName>
                    <Initials>R</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Nephrology, Jiangdu People's Hospital, Yangzhou, China.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Wu</LastName>
                    <ForeName>Guanji</ForeName>
                    <Initials>G</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Cardiology, Xi'an Central Hospital of Xi'an Jiaotong University, Xi'an, China, [email protected].</Affiliation>
                    </AffiliationInfo>
                </Author>
            </AuthorList>
            <Language>eng</Language>
            <clonelicationTypeList>
                <clonelicationType UI="D016428">Journal Article</clonelicationType>
            </clonelicationTypeList>
            <ArticleDate DateType="Electronic">
                <Year>2020</Year>
                <Month>02</Month>
                <Day>11</Day>
            </ArticleDate>
        </Article>
        <MedlineJournalInfo>
            <Country>Switzerland</Country>
            <MedlineTA>J Vasc Res</MedlineTA>
            <NlmUniqueID>9206092</NlmUniqueID>
            <ISSNLinking>1018-1172</ISSNLinking>
        </MedlineJournalInfo>
        <CitationSubset>IM</CitationSubset>
        <KeywordList Owner="NOTNLM">
            <Keyword MajorTopicYN="N">Migration</Keyword>
            <Keyword MajorTopicYN="N">NFAT5</Keyword>
            <Keyword MajorTopicYN="N">Proliferation</Keyword>
            <Keyword MajorTopicYN="N">Vascular smooth muscle cell</Keyword>
            <Keyword MajorTopicYN="N">miR-96-5p</Keyword>
        </KeywordList>
    </MedlineCitation>
    <CardData>
        <History>
            <CardcloneDate cloneStatus="received">
                <Year>2019</Year>
                <Month>09</Month>
                <Day>16</Day>
            </CardcloneDate>
            <CardcloneDate cloneStatus="accepted">
                <Year>2019</Year>
                <Month>12</Month>
                <Day>16</Day>
            </CardcloneDate>
            <CardcloneDate cloneStatus="entrez">
                <Year>2020</Year>
                <Month>2</Month>
                <Day>12</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </CardcloneDate>
            <CardcloneDate cloneStatus="Card">
                <Year>2020</Year>
                <Month>2</Month>
                <Day>12</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </CardcloneDate>
            <CardcloneDate cloneStatus="medline">
                <Year>2020</Year>
                <Month>2</Month>
                <Day>12</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </CardcloneDate>
        </History>
        <clonelicationStatus>aheadofprint</clonelicationStatus>
        <ArticleIdList>
            <ArticleId IdType="Card">32045906</ArticleId>
            <ArticleId IdType="pii">000505457</ArticleId>
            <ArticleId IdType="doi">10.1159/000505457</ArticleId>
        </ArticleIdList>
    </CardData>
</Caplock>
</CaplockSet>'''

root = ET.fromstring(xml)
data = [{'Affiliation':a.find('AffiliationInfo/Affiliation').text,'ForeName': a.find('ForeName').text,'LastName': a.find('LastName').text} for a in root.findall('.//Author')]

Upvotes: 0

Mo7art
Mo7art

Reputation: 303

Maybe this should work

def find_rec(node):
    for item in node.iter():
        if item.tag == "Author":
            author_values = {}
            for i in item.iter():
                author_values[i.tag] = i.text
            yield author_values


auth = find_rec(ET.parse('./data.xml').getroot())
for x in auth:
    print(x["LastName"], x["ForeName"], x["Affiliation"])

Upvotes: 2

Related Questions