Pattern to validate PDF and Excel file type

Question

I have a web-app that allows users to upload attachments; however, I want to limit the user to only certain file types - Adobe PDF and MS Excel. The reason being is just before the user submits the document for processing and workflow, I will aggregate some of the attachments and create a single PDF report.

I did some research and converting DOC(X), RTF, etc... would be headache. Plus everybody will "in theory" get better viewing portability if the attachments are all in PDF.

Currently I am checking the mime type -

PDF - "application/pdf"

XLS(X) -

"application/vnd.ms-excel"
"application/msexcel"
"application/x-msexcel"
"application/x-ms-excel"
"application/x-excel"
"application/x-dos_ms_excel"
"application/xls"
"application/x-xls"

This is working well, except I've noticed that I can take for instance a .docx file and change it's extension to .pdf and successfully get around this check.

To remedy that, I plan to further check the actual file's header.

According to this library of file signatures

PDF will have the following header -

25 50 44 46

AND it will have one of the following trailers -

0A 25 25 45 4F 46 (.%%EOF)
0A 25 25 45 4F 46 0A (.%%EOF.)
0D 0A 25 25 45 4F 46 0D 0A (..%%EOF..)
0D 25 25 45 4F 46 0D (.%%EOF.)

So far I have the skeleton code that will perform this check -

** EDITED TO REFLECT ANSWER **

public boolean confirmAttachmentAuthenticity(ProposalDevelopmentForm form, String mimeType) {
    boolean authentic = true;
    // Case:  User is attempting to upload a "PDF" document
    if (mimeType.equals(ADOBE_PDF_CONTENT_TYPE)) {
        try {
            InputStream inputStream = form.getNewNarrative().getNarrativeFile().getInputStream();
            PdfReader pdfReader = new PdfReader(inputStream);
            int numberOfPages = pdfReader.getNumberOfPages();
            if (numberOfPages > 0) {
                // Success - valid PDF
                info(form.getNewNarrative().getNarrativeFile().getFileName() + " validated authentic Adobe PDF file");
            }
        }
        catch(IOException ioe) {
            // Failure - masquerading PDF
            authentic = false;
            info(form.getNewNarrative().getNarrativeFile().getFileName() + " is not an authentic Adobe PDF file.");
            reportError("newNarrative.narrativeFile",
                KeyConstants.ERROR_ATTACHMENT_PDF_NOT_AUTHENTIC,
                form.getNewNarrative().getNarrativeFile().getFileName());
        }
        catch (Exception e) {
            // Failure - other causes
            authentic = false;
            info(form.getNewNarrative().getNarrativeFile().getFileName() + " could not be authenticated at this time.");
            e.printStackTrace();
            reportError("newNarrative.narrativeFile",
                KeyConstants.ERROR_ATTACHMENT_TYPE_CORRUPTED,
                form.getNewNarrative().getNarrativeFile().getFileName());
        }
    }
    // Case: User is attempting to upload an "EXCEL" spreadsheet
    else {
        try {
            InputStream inputStream = form.getNewNarrative().getNarrativeFile().getInputStream();
            POIFSFileSystem fileSystem = new POIFSFileSystem(inputStream);
            HSSFWorkbook workBook = new HSSFWorkbook(fileSystem);
            int numberOfSheets = workBook.getNumberOfSheets();
            if (numberOfSheets > 0) {
                // Success - valid Excel Spreadsheet
                info(form.getNewNarrative().getNarrativeFile().getFileName() + " validated authentic MS Excel file");
            }
        }
        catch(IOException ioe) {
            // Failure - masquerading XLS(X)
            authentic = false;
            info(form.getNewNarrative().getNarrativeFile().getFileName() + " is not an authentic MS Excel file.");
            reportError("newNarrative.narrativeFile",
                KeyConstants.ERROR_ATTACHMENT_XLS_NOT_AUTHENTIC,
                form.getNewNarrative().getNarrativeFile().getFileName());
        }
        catch (Exception e) {
            // Failure - other causes
            authentic = false;
            info(form.getNewNarrative().getNarrativeFile().getFileName() + " could not be authenticated at this time.");
            e.printStackTrace();
            reportError("newNarrative.narrativeFile",
                    KeyConstants.ERROR_ATTACHMENT_TYPE_CORRUPTED,
                    form.getNewNarrative().getNarrativeFile().getFileName());
        }
    }
    return authentic;
}

I'm thinking the best approach would be use the BinarySearch method to do this. But, I've also read some posts where people have suggested converting the fileData into a string and then using regular expressions.

Any thoughts would be appreciated.

Bonus points if you can help me start filling in my skeleton code for either case. My bit-wise logic knowledge is rusty. That's what I get for coding mostly high level client side code for the past year.

Mifmif · Accepted Answer

Do never trust incoming requests from clients, headers values could be changed and it doesn't reflect what is in the body of the request .

use instead a third parties libraries to check if the file is a PDF or Excel or something else.

to check if a document is a PDF try for example to open it using iText, and for Excel try to open it using Apache POI.

Pattern to validate PDF and Excel file type

Answers (1)

Related Questions