Reputation: 4921
I am getting my form data from frontend and reading it using fast api as shown below:
@app.post("/file_upload")
async def upload_file(pdf: UploadFile = File(...)):
print("Content = ",pdf.content_type,pdf.filename,pdf.spool_max_size)
return {"filename": "Succcess"}
Now what I need to do is extract hyperlinks from these spool Files with the help of pypdfextractor
as shown below:
import pdfx
from os.path import exists
from config import availableUris
def getHrefsFromPDF(pdfPath:str)->dict:
if not(exists(pdfPath)):
raise FileNotFoundError("PDF File not Found")
pdf = pdfx.PDFx(pdfPath)
return pdf.get_references_as_dict().get('url',[])
But I am not sure how to convert spool file (Received from FAST API) to pdfx readable file format.
Additionally, I also tried to study the bytes that come out of the file. When I try to do this:
data = await pdf.read()
data type shows as : bytes
when I try to convert it using str
function it gives a unicoded encoded string which is totally a gibberish to me, I also tried to decode using "utf-8" which throws UnicodeDecodeError.
Upvotes: 1
Views: 202
Reputation: 5954
fastapi gives you a SpooledTemporaryFile
. You may be able to use that file object directly if there is some api in pdfx which will work on a File()
object rather than a str representing a path (!). Otherwise make a new temporary file on disk and work with that:
from tempfile import TemporaryDirectory
from pathlib import Path
import pdfx
@app.post("/file_upload")
async def upload_file(pdf: UploadFile = File(...)):
with TemporaryDirectory() as d: #Adding the file into a temporary storage for re-reading purposes
tmpf = Path(d) / "pdf.pdf"
with tmpf.open("wb") as f:
f.write(pdf.read())
p = pdfx.PDFX(str(tmpf))
...
It may be that pdfx.PDFX
will take a Path
object. I'll update this answer if so. I've kept the read-write loop synchronous for ease, but you can make it asynchronous if there is a reason to do so.
Note that it would be better to find a way of doing this with the SpooledTemporaryFile
.
As to your data showing as bytes: well, pdfs are (basically) binary files: what did you expect?
Upvotes: 1