BentoML Has a Free API: Deploy ML Models to Production in 5 Minutes

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5175

    #1

    BentoML Has a Free API: Deploy ML Models to Production in 5 Minutes

    What is BentoML?

    BentoML is an open-source framework for serving machine learning models. It turns any Python ML model into a production-ready API with batching, GPU support, and Docker packaging — without writing any infrastructure code.


    Why BentoML?

    • Free and open-source — Apache 2.0 license
    • Any framework — PyTorch, TensorFlow, scikit-learn, HuggingFace, XGBoost
    • Adaptive batching — automatically batch requests for GPU efficiency
    • Docker-ready — one command to containerize
    • BentoCloud — managed deployment with free tier
    • OpenLLM — specialized serving for large language models


    Quick Start





    pip install bentoml











    # service.py
    import bentoml
    from transformers import pipeline

    @bentoml.service(
    resources={"gpu": 1, "memory": "4Gi"},
    traffic={"timeout": 60}
    )
    class SentimentAnalysis:
    def __init__(self):
    self.classifier = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device=0 # GPU
    )

    @bentoml.api
    def classify(self, text: str) -> dict:
    result = self.classifier(text)[0]
    return {"label": result["label"], "score": round(result["score"], 4)}

    @bentoml.api
    def batch_classify(self, texts: list[str]) -> list[dict]:
    results = self.classifier(texts)
    return [{"label": r["label"], "score": round(r["score"], 4)} for r in results]











    # Run locally
    bentoml serve service:SentimentAnalysis

    # Test
    curl -X POST http://localhost:3000/classify \
    -H 'Content-Type: application/json' \
    -d '{"text": "This product is amazing!"}'







    Serve LLMs with OpenLLM





    pip install openllm

    # Serve any HuggingFace model
    openllm start meta-llama/Llama-3-8b-chat-hf

    # OpenAI-compatible endpoint at localhost:3000
    curl http://localhost:3000/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d '{"model": "meta-llama/Llama-3-8b-chat-hf", "messages": [{"role": "user", "content": "Hello!"}]}'







    Adaptive Batching





    @bentoml.service(
    traffic={
    "timeout": 60,
    "max_batch_size": 32,
    "batch_wait_timeout": 0.5 # Wait up to 500ms to fill batch
    }
    )
    class ImageClassifier:
    @bentoml.api(batchable=True)
    def predict(self, images: list[np.ndarray]) -> list[str]:
    # BentoML automatically batches individual requests
    # 100 individual API calls become ~4 batched GPU operations
    return self.model.predict(np.stack(images))







    Containerize and Deploy





    # Build a Bento (production package)
    bentoml build

    # Containerize
    bentoml containerize sentiment_analysis:latest

    # Run with Docker
    docker run -p 3000:3000 sentiment_analysis:latest

    # Or deploy to BentoCloud
    bentoml deploy .







    BentoML vs Alternatives

    ML-specific Yes General PyTorch only Multi-framework
    Adaptive batching Built-in Manual Built-in Built-in
    Docker packaging One command Manual Manual Manual
    GPU management Automatic Manual Automatic Automatic
    OpenAI-compatible Via OpenLLM Manual No No
    Learning curve Low Low High Very high


    Real-World Impact

    An e-commerce company served image classification models via Flask. At 1,000 requests/sec, each image processed individually — GPU utilization was 15%. After migrating to BentoML with adaptive batching: same hardware handles 10,000 requests/sec at 85% GPU utilization. They cancelled their GPU scale-out plan and saved $15K/month.





    Deploying ML models to production? I help teams build efficient serving infrastructure. Contact spinov001@gmail.com or explore my data tools on Apify.




    More...
Working...