Isa Gallego
Isa Gallego

Reputation: 1

Extracting data from XML Pubmed articles

Im trying to write a python script that takes articles from de pubmed db, extracts information and creates a SQL database. The info I need from the articles is this one:

import pandas as pd
import xml.etree.ElementTree as ET
import sqlite3
from glob import glob

data = []  # Inicializar la lista de datos
for xml_file in glob('*.xml'):  # Cambiando a .xml
    for event, elem in ET.iterparse(xml_file, events=('start', 'end')):
        if event == 'start':
            if elem.tag == "PubmedArticle":
                pub = {}  # Inicializar el diccionario del artículo

            if elem.tag == 'PMID':
                pub["PMID"] = elem.text

            elif elem.tag == 'ArticleTitle':
                pub["Title"] = elem.text

            elif elem.tag == 'Year':
                pub["Year"] = elem.text

            elif elem.tag == 'ELocationID':  # Asumiendo que el DOI está en este tag
                pub["DOI"] = elem.text if elem.attrib.get('EIdType') == 'doi' else None

            elif elem.tag == 'Title' and elem.getparent().tag == 'Journal':
                pub["JournalName"] = elem.text

            elif elem.tag == 'AuthorList':
                first_author = elem[0]
                pub["FirstAuthor"] = first_author.find('LastName').text + ", " + first_author.find('ForeName').text if first_author is not None else None

            elif elem.tag == 'AbstractText':
                pub["Abstract"] = elem.text

            # Los campos 'Content', 'Methods' y 'Results' no están contemplados aquí debido a la ambigüedad en su representación XML

        if event == 'end':
            if elem.tag == "PubmedArticle":
                data.append(pub)  # Añadir múltiples artículos

        elem.clear()

# Construir el DataFrame de datos XML
df = pd.DataFrame(data)

# Conectar o crear la base de datos SQLite
conn = sqlite3.connect('pubmed_articles.db')

# Escribir el DataFrame en la base de datos
df.to_sql('article_data', conn, if_exists='replace', index=False)

# Cerrar la conexión
conn.close()

Does anyone have any idea on how to do this?

I've tried using beautifulsoup and R package easyPubMed, but nothing's working or I guess I just don't really now how to do it.

Upvotes: 0

Views: 138

Answers (0)

Related Questions