Deploying GLM-OCR as a Serverless GPU API on Modal

I put together a small repo that deploys GLM-OCR to Modal as a PDF-to-Markdown API.

The goal was simple. I wanted something open, reasonably compact, and easy to drop into an LLM ingestion pipeline without dragging in a lot of extra infrastructure. GLM-OCR turned out to be a nice fit for that, and Modal makes it straightforward to expose as a public GPU-backed endpoint.

Here for the code?

View the repo here

What's happening

There are plenty of OCR options around, but I wanted something simple and hackable. GLM-OCR is a compact document OCR model that works through transformers and can emit structured text from page images without too much ceremony.

The repo keeps the deployment side equally small. It is just a FastAPI app and a Modal GPU class. The endpoint accepts a PDF, the worker renders each page to an image, runs OCR, and returns the result as Markdown.

That is really the whole thing.

The shape of the code

The request flow is straightforward:

  1. POST a PDF to /parse
  2. render the PDF pages with pypdfium2
  3. send each page image to GLM-OCR
  4. return the result as Markdown

You can hit it with curl like this:

curl -X POST "https://<your-modal-url>/parse" \
  -F "pdf=@/absolute/path/to/file.pdf"

The nice part is how little code you need around the model itself. The worker loads GLM-OCR once, and the API calls it remotely:

@app.cls(
    image=inference_image,
    gpu="L40S",
    volumes={MODEL_CACHE_PATH: hf_cache_volume},
)
class GlmOcrService:
    @modal.enter(snap=True)
    def load(self) -> None:
        from transformers import AutoModelForImageTextToText, AutoProcessor

        self.processor = AutoProcessor.from_pretrained(MODEL_NAME)
        self.model = AutoModelForImageTextToText.from_pretrained(
            pretrained_model_name_or_path=MODEL_NAME,
            torch_dtype="auto",
            device_map="auto",
        )

You can keep thinking in normal Python functions and classes, but still end up with a public serverless GPU service.

Modal is a good fit for bursty workloads like PDF processing that do not need to sit online 24/7.

In this setup I am using:

  • a FastAPI endpoint for uploads
  • an L40S GPU worker
  • a persisted Hugging Face cache so model weights do not need to be pulled every time
  • scale-to-zero after about 1 minute of inactivity

That gives you a proper hosted OCR API without having to manage a VM, container orchestration, or a permanently warm GPU.

Observed performance

The important bit is real-world behaviour, not just model-card benchmarks.

In my quick tests with a 4-page PDF:

  • cold start: 26.08s
  • warm start: 17.98s
  • warm start: 18.17s

So in practice this looks like:

  • roughly 8 seconds of cold-start overhead
  • then about 4.5 seconds per page once the worker is warm

That is obviously much slower than raw model benchmark figures, but that is the nature of end-to-end API timings. You are measuring model inference. as well as request handling, PDF rasterisation, image preparation, OCR, and Markdown assembly.

You can do this for almost free

As of 20 March 2026, Modal's Starter plan includes $30/month in free compute credits.

This repo uses an L40S, which is currently priced at $0.000542/sec, or about $1.95/hour. That works out to roughly 15.4 hours of GPU runtime per month before you pay anything beyond the included credit.

You can deploy a real OCR API backed by a proper GPU, keep the code small, and experiment with it without the cost becoming a project in itself.

Cover photo by Brett Jordan on Unsplash