jss3000
jss3000

Reputation: 23

How do you skip over files with no extension when downloading them?

My code is working correctly to scour a directory of PDFs, download weblinks embedded within those PDFs, and sequentially name them with appropriate file extension.

That being said - I am getting a few random files that download but DON'T have an extension associated with them. In doing quality checks, I have all the attachments that matter - these extra files are truly garbage.

Is there a way to not download them or build in a check in the code so that I don't end up with these phantom files?

#!/usr/bin/env python3
import os
import glob
import pdfx
import wget
import urllib.parse
import requests

## Accessing and Creating Six Digit File Code
pdf_dir = "./"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)

for file in pdf_files:
    ## Identify File Name and Limit to Digits
    filename = os.path.basename(file)
    newname = filename[0:6]
    
    ## Run PDFX to identify and download links
    pdf = pdfx.PDFx(filename)
    url_list = pdf.get_references_as_dict()
    attachment_counter = (1)

    for x in url_list["url"]:
        if x[0:4] == "http":
            parsed_url = urllib.parse.quote(x)
            extension = os.path.splitext(x)[1]
            r = requests.get(x)
            with open('temporary', 'wb') as f:
                f.write(r.content)

            ##Concatenate File Name Once Downloaded
            os.rename('./temporary', str(newname) + '_attach' + str(attachment_counter) + str(extension))
            
            ##Increase Attachment Count
            attachment_counter += 1
    
    for x in url_list["pdf"]:
            parsed_url = urllib.parse.quote(x)
            extension = os.path.splitext(x)[1]
            r = requests.get(x)
            with open('temporary', 'wb') as f:
                f.write(r.content)

            ##Concatenate File Name Once Downloaded
            os.rename('./temporary', str(newname) + '_attach' + str(attachment_counter) + str(extension))
            
            ##Increase Attachment Count
            attachment_counter += 1

Upvotes: 1

Views: 133

Answers (1)

tripleee
tripleee

Reputation: 189830

It's not clear which part of your code produces these "phantom" files, but anyplace you want to avoid downloading a file which doesn't have an extension, you can make the download conditional. If the component after the last slash doesn't contain a dot, do nothing.

        if '.' in x.split('/')[-1]:
            ... dowload(x) etc

Upvotes: 2

Related Questions