tom
tom

Reputation: 307

getting weird results from metapub and pubmed

I am using the code below to get any free journal pdfs from pubmed. It does downloadload something that when I look at it, just consists of the number 1.. Any ideas on where I am going wrong? Thank you

import metapub
from urllib.request import urlretrieve
import textract
from pathlib import Path

another_path='/content/Articles/'

pmid_list=['35566889','33538053', '30848212']

for i in range(len(pmid_list)):
  query=pmid_list[i]

#for ind in pmid_df.index:
#  query= pmid_df['PMID'][ind]

 
  url = metapub.FindIt(query).url
 
  try:
       urlretrieve(url)
       file_name = query

       out_file  = another_path + file_name

       with open(out_file, "w") as textfile:
            textfile.write(textract.process(out_file,extension='pdf',method='pdftotext',encoding="utf_8",
    ))
  except:
      continue

Upvotes: 0

Views: 300

Answers (1)

furas
furas

Reputation: 143032

I see two mistakes.

First: urlretrieve(url) saves data in temporary file with random filename - so you can't access it because you don't know its filename. You should use second parameter to save it with own filename.

urlretrieve(url, file_name)

Second: you use the same out_file to process file (process(out_file)) and write result (open(out_file, 'w')) - but first you use open() which deletes all content in file and later it will process empty file. You should first process file and later open it for writing.

data = textract.process(out_file, extension='pdf', method='pdftotext', encoding="utf_8")

with open(out_file, "wb") as textfile:  # save bytes
     textfile.write(data)

or you should write result with different name (i.e with extension .txt)`


Full working example with other small changes

import os
from urllib.request import urlretrieve
import metapub
import textract

#another_path = '/content/Articles/'
another_path = './'

pmid_list = ['35566889','33538053', '30848212']

for query in pmid_list:

    print('query:', query)
    
    url = metapub.FindIt(query).url
    print('url:', url)
    
    if url:
        
        try:
            out_file = os.path.join(another_path, query)
            print('out_file:', out_file)

            print('... downloading')

            urlretrieve(url, out_file + '.pdf')
    
            print('... processing')
    
            data = textract.process(out_file + '.pdf', extension='pdf', method='pdftotext', encoding="utf_8")
    
            print('... saving')
            
            with open(out_file + '.txt', "wb") as textfile:  # save bytes
                textfile.write(data)
    
            print('... OK')
            
        except Exception as ex:
            print('Exception:', ex)

Upvotes: 1

Related Questions