Reputation: 53606

How to read a file into memory in FastAPI and pass it to MarkItDown library?

The need is to upload a file to a FastAPI endpoint, convert it to Markdown and save the text to Redis (Files are up to 4MB in size).

The only logic I have found so far is to upload the file as UploadFile, read the contents, save them to disk with the right extension, pass that path to MarkItDown library, read that markdown file again, and then pass it to Redis. Way too much I/O. Is there a way to do all of this in memory?

(For the sake of code simplicity, I removed all error handling and I assume only text files)

@router.post("/upload")
async def uploadPost(filepond: UploadFile = File()):
    """
    Convert a textual file to markdown.
    Store in Redis 
    """
    # Create a temporary file to save the uploaded content 
    # for sake of simplicity I use txt for everything
    with NamedTemporaryFile(delete=False,suffix=".txt") as temp_file:
        temp_file_path = temp_file.name
        content = await filepond.read()
        temp_file.write(content)
        temp_file.close()

        md = MarkItDown()
        result = md.convert(temp_file_path)
        redis.setex("some key", 3600, result.text_content)

    os.remove(temp_file_path)

Upvotes: 1

Answers (3)

TALHA AKHTAR

Reputation: 1

from fastapi import APIRouter, UploadFile, File
import aioredis
from markitdown import MarkItDown

router = APIRouter()

# Initialize Redis client
redis = aioredis.from_url("redis://localhost", decode_responses=True)

@router.post("/upload")
async def upload_post(file: UploadFile = File(...)):
    """
    Converts an uploaded text file to Markdown and stores the result in Redis.
    """
    content = await file.read()  # Read file content into memory
    text = content.decode("utf-8")  # Decode bytes to string

    md = MarkItDown()
    result = md.convert(text)  # Convert text to Markdown

    await redis.setex("markdown_content", 3600, result.text_content)  # Store in Redis with expiration

    return {"message": "File successfully processed and stored in Redis"}

Upvotes: -2

Chris

Reputation: 34551

It seems that you are limited by the library you are currently using, not FastAPI, which offers a way to get the request body in chunks as they arrive (using request.stream() instead of UploadFile)—see this answer and this answer.

The library you are using includes a convert_stream() method, but it doesn't seem to do what the name actually implies. The stream parameter (which doesn't have a definite type) is used to read the entire contents at once and simply store them to a temporary file (essentially, similar to your current approach).

Given the limitations of the library, you might still benefit from using request.stream() (even though with files of up to 4MB in size, as mentioned that you are using, might not be that noticable) to write the chunks as they arrive to a NamedTemporaryFile directly, compared to using UploadFile, which would store files larger than 1MB to a SpooledTemporaryFile that you later need to read the contents from, as explained in this answer. Hence, you would at least avoid writing and reading from two temporary files, unecessarily, as shown in the example provided in your question. Similar examples could be found here, as well as here and here.

Example

from fastapi import FastAPI, Request, HTTPException
from fastapi.concurrency import run_in_threadpool
from tempfile import NamedTemporaryFile
import aiofiles
import os

app = FastAPI()

    
@app.post('/upload')
async def upload(request: Request):
    try:
        async with aiofiles.tempfile.NamedTemporaryFile("wb", delete=False, suffix=".txt") as temp:
            try:
                async for chunk in request.stream():
                    await temp.write(chunk)
            except Exception:
                raise HTTPException(status_code=500, detail='Something went wrong')
                
         # You could have the `convert` function run in an external ThreadPool/ProcessPool, 
         # in order to avoid blocking the event loop
         md = MarkItDown()
         res = await run_in_threadpool(md.convert, temp.name)
    except Exception:
        raise HTTPException(status_code=500, detail='Something went wrong')
    finally:
        os.remove(temp.name)

Upvotes: 1

Yohei Kitamura

Reputation: 45

What do you think about using tmpfs to store a temporary file in memory?
I think libraries like "memory-tempfile" are worth considering.

Upvotes: 0

How to read a file into memory in FastAPI and pass it to MarkItDown library?

Answers (3)

Example

Related Questions