Reputation: 1
Im trying to write a python script that takes articles from de pubmed db, extracts information and creates a SQL database. The info I need from the articles is this one:
import pandas as pd
import xml.etree.ElementTree as ET
import sqlite3
from glob import glob
data = [] # Inicializar la lista de datos
for xml_file in glob('*.xml'): # Cambiando a .xml
for event, elem in ET.iterparse(xml_file, events=('start', 'end')):
if event == 'start':
if elem.tag == "PubmedArticle":
pub = {} # Inicializar el diccionario del artículo
if elem.tag == 'PMID':
pub["PMID"] = elem.text
elif elem.tag == 'ArticleTitle':
pub["Title"] = elem.text
elif elem.tag == 'Year':
pub["Year"] = elem.text
elif elem.tag == 'ELocationID': # Asumiendo que el DOI está en este tag
pub["DOI"] = elem.text if elem.attrib.get('EIdType') == 'doi' else None
elif elem.tag == 'Title' and elem.getparent().tag == 'Journal':
pub["JournalName"] = elem.text
elif elem.tag == 'AuthorList':
first_author = elem[0]
pub["FirstAuthor"] = first_author.find('LastName').text + ", " + first_author.find('ForeName').text if first_author is not None else None
elif elem.tag == 'AbstractText':
pub["Abstract"] = elem.text
# Los campos 'Content', 'Methods' y 'Results' no están contemplados aquí debido a la ambigüedad en su representación XML
if event == 'end':
if elem.tag == "PubmedArticle":
data.append(pub) # Añadir múltiples artículos
elem.clear()
# Construir el DataFrame de datos XML
df = pd.DataFrame(data)
# Conectar o crear la base de datos SQLite
conn = sqlite3.connect('pubmed_articles.db')
# Escribir el DataFrame en la base de datos
df.to_sql('article_data', conn, if_exists='replace', index=False)
# Cerrar la conexión
conn.close()
Does anyone have any idea on how to do this?
I've tried using beautifulsoup and R package easyPubMed, but nothing's working or I guess I just don't really now how to do it.
Upvotes: 0
Views: 138