user1492226
user1492226

Reputation:

Pattern to validate PDF and Excel file type

I have a web-app that allows users to upload attachments; however, I want to limit the user to only certain file types - Adobe PDF and MS Excel. The reason being is just before the user submits the document for processing and workflow, I will aggregate some of the attachments and create a single PDF report.

I did some research and converting DOC(X), RTF, etc... would be headache. Plus everybody will "in theory" get better viewing portability if the attachments are all in PDF.

Currently I am checking the mime type -

PDF - "application/pdf"

XLS(X) -

This is working well, except I've noticed that I can take for instance a .docx file and change it's extension to .pdf and successfully get around this check.

To remedy that, I plan to further check the actual file's header.

According to this library of file signatures

PDF will have the following header -

25 50 44 46

AND it will have one of the following trailers -

So far I have the skeleton code that will perform this check -

** EDITED TO REFLECT ANSWER **

public boolean confirmAttachmentAuthenticity(ProposalDevelopmentForm form, String mimeType) {
    boolean authentic = true;
    // Case:  User is attempting to upload a "PDF" document
    if (mimeType.equals(ADOBE_PDF_CONTENT_TYPE)) {
        try {
            InputStream inputStream = form.getNewNarrative().getNarrativeFile().getInputStream();
            PdfReader pdfReader = new PdfReader(inputStream);
            int numberOfPages = pdfReader.getNumberOfPages();
            if (numberOfPages > 0) {
                // Success - valid PDF
                info(form.getNewNarrative().getNarrativeFile().getFileName() + " validated authentic Adobe PDF file");
            }
        }
        catch(IOException ioe) {
            // Failure - masquerading PDF
            authentic = false;
            info(form.getNewNarrative().getNarrativeFile().getFileName() + " is not an authentic Adobe PDF file.");
            reportError("newNarrative.narrativeFile",
                KeyConstants.ERROR_ATTACHMENT_PDF_NOT_AUTHENTIC,
                form.getNewNarrative().getNarrativeFile().getFileName());
        }
        catch (Exception e) {
            // Failure - other causes
            authentic = false;
            info(form.getNewNarrative().getNarrativeFile().getFileName() + " could not be authenticated at this time.");
            e.printStackTrace();
            reportError("newNarrative.narrativeFile",
                KeyConstants.ERROR_ATTACHMENT_TYPE_CORRUPTED,
                form.getNewNarrative().getNarrativeFile().getFileName());
        }
    }
    // Case: User is attempting to upload an "EXCEL" spreadsheet
    else {
        try {
            InputStream inputStream = form.getNewNarrative().getNarrativeFile().getInputStream();
            POIFSFileSystem fileSystem = new POIFSFileSystem(inputStream);
            HSSFWorkbook workBook = new HSSFWorkbook(fileSystem);
            int numberOfSheets = workBook.getNumberOfSheets();
            if (numberOfSheets > 0) {
                // Success - valid Excel Spreadsheet
                info(form.getNewNarrative().getNarrativeFile().getFileName() + " validated authentic MS Excel file");
            }
        }
        catch(IOException ioe) {
            // Failure - masquerading XLS(X)
            authentic = false;
            info(form.getNewNarrative().getNarrativeFile().getFileName() + " is not an authentic MS Excel file.");
            reportError("newNarrative.narrativeFile",
                KeyConstants.ERROR_ATTACHMENT_XLS_NOT_AUTHENTIC,
                form.getNewNarrative().getNarrativeFile().getFileName());
        }
        catch (Exception e) {
            // Failure - other causes
            authentic = false;
            info(form.getNewNarrative().getNarrativeFile().getFileName() + " could not be authenticated at this time.");
            e.printStackTrace();
            reportError("newNarrative.narrativeFile",
                    KeyConstants.ERROR_ATTACHMENT_TYPE_CORRUPTED,
                    form.getNewNarrative().getNarrativeFile().getFileName());
        }
    }
    return authentic;
}

I'm thinking the best approach would be use the BinarySearch method to do this. But, I've also read some posts where people have suggested converting the fileData into a string and then using regular expressions.

Any thoughts would be appreciated.

Bonus points if you can help me start filling in my skeleton code for either case. My bit-wise logic knowledge is rusty. That's what I get for coding mostly high level client side code for the past year.

Upvotes: 0

Views: 1723

Answers (1)

Mifmif
Mifmif

Reputation: 3190

Do never trust incoming requests from clients, headers values could be changed and it doesn't reflect what is in the body of the request .

use instead a third parties libraries to check if the file is a PDF or Excel or something else.

to check if a document is a PDF try for example to open it using iText, and for Excel try to open it using Apache POI.

Upvotes: 1

Related Questions