Reputation:
I have a web-app that allows users to upload attachments; however, I want to limit the user to only certain file types - Adobe PDF and MS Excel. The reason being is just before the user submits the document for processing and workflow, I will aggregate some of the attachments and create a single PDF report.
I did some research and converting DOC(X), RTF, etc... would be headache. Plus everybody will "in theory" get better viewing portability if the attachments are all in PDF.
Currently I am checking the mime type -
PDF - "application/pdf"
XLS(X) -
"application/vnd.ms-excel"
"application/msexcel"
"application/x-msexcel"
"application/x-ms-excel"
"application/x-excel"
"application/x-dos_ms_excel"
"application/xls"
"application/x-xls"
This is working well, except I've noticed that I can take for instance a .docx
file and change it's extension to .pdf
and successfully get around this check.
To remedy that, I plan to further check the actual file's header.
According to this library of file signatures
PDF will have the following header -
25 50 44 46
AND it will have one of the following trailers -
0A 25 25 45 4F 46 (.%%EOF)
0A 25 25 45 4F 46 0A (.%%EOF.)
0D 0A 25 25 45 4F 46 0D 0A (..%%EOF..)
0D 25 25 45 4F 46 0D (.%%EOF.)
So far I have the skeleton code that will perform this check -
** EDITED TO REFLECT ANSWER **
public boolean confirmAttachmentAuthenticity(ProposalDevelopmentForm form, String mimeType) {
boolean authentic = true;
// Case: User is attempting to upload a "PDF" document
if (mimeType.equals(ADOBE_PDF_CONTENT_TYPE)) {
try {
InputStream inputStream = form.getNewNarrative().getNarrativeFile().getInputStream();
PdfReader pdfReader = new PdfReader(inputStream);
int numberOfPages = pdfReader.getNumberOfPages();
if (numberOfPages > 0) {
// Success - valid PDF
info(form.getNewNarrative().getNarrativeFile().getFileName() + " validated authentic Adobe PDF file");
}
}
catch(IOException ioe) {
// Failure - masquerading PDF
authentic = false;
info(form.getNewNarrative().getNarrativeFile().getFileName() + " is not an authentic Adobe PDF file.");
reportError("newNarrative.narrativeFile",
KeyConstants.ERROR_ATTACHMENT_PDF_NOT_AUTHENTIC,
form.getNewNarrative().getNarrativeFile().getFileName());
}
catch (Exception e) {
// Failure - other causes
authentic = false;
info(form.getNewNarrative().getNarrativeFile().getFileName() + " could not be authenticated at this time.");
e.printStackTrace();
reportError("newNarrative.narrativeFile",
KeyConstants.ERROR_ATTACHMENT_TYPE_CORRUPTED,
form.getNewNarrative().getNarrativeFile().getFileName());
}
}
// Case: User is attempting to upload an "EXCEL" spreadsheet
else {
try {
InputStream inputStream = form.getNewNarrative().getNarrativeFile().getInputStream();
POIFSFileSystem fileSystem = new POIFSFileSystem(inputStream);
HSSFWorkbook workBook = new HSSFWorkbook(fileSystem);
int numberOfSheets = workBook.getNumberOfSheets();
if (numberOfSheets > 0) {
// Success - valid Excel Spreadsheet
info(form.getNewNarrative().getNarrativeFile().getFileName() + " validated authentic MS Excel file");
}
}
catch(IOException ioe) {
// Failure - masquerading XLS(X)
authentic = false;
info(form.getNewNarrative().getNarrativeFile().getFileName() + " is not an authentic MS Excel file.");
reportError("newNarrative.narrativeFile",
KeyConstants.ERROR_ATTACHMENT_XLS_NOT_AUTHENTIC,
form.getNewNarrative().getNarrativeFile().getFileName());
}
catch (Exception e) {
// Failure - other causes
authentic = false;
info(form.getNewNarrative().getNarrativeFile().getFileName() + " could not be authenticated at this time.");
e.printStackTrace();
reportError("newNarrative.narrativeFile",
KeyConstants.ERROR_ATTACHMENT_TYPE_CORRUPTED,
form.getNewNarrative().getNarrativeFile().getFileName());
}
}
return authentic;
}
I'm thinking the best approach would be use the BinarySearch
method to do this.
But, I've also read some posts where people have suggested converting the fileData into a string and then using regular expressions.
Any thoughts would be appreciated.
Bonus points if you can help me start filling in my skeleton code for either case. My bit-wise logic knowledge is rusty. That's what I get for coding mostly high level client side code for the past year.
Upvotes: 0
Views: 1723
Reputation: 3190
Do never trust incoming requests from clients, headers values could be changed and it doesn't reflect what is in the body of the request .
use instead a third parties libraries to check if the file is a PDF or Excel or something else.
to check if a document is a PDF try for example to open it using iText, and for Excel try to open it using Apache POI.
Upvotes: 1