Rename VLLM to VLLMOffline and create server-based VLLM/AsyncVLLM #1549

RobinPicard · 2025-04-22T17:22:25Z

Addresses #1548

For the moment, this PR only creates the vllm server models and modifies the generator

outlines/generator.py

outlines/models/vllm_server.py

cpfiffer · 2025-04-23T00:04:45Z

Do you have an example of use here?

RobinPicard · 2025-04-23T08:15:54Z

Do you have an example of use here?

I was using the script below to test it @cpfiffer

Script

import asyncio
from openai import AsyncOpenAI, OpenAI
from outlines.models import from_vllm_server
from outlines.generator import Generator

from pydantic import BaseModel


class Message(BaseModel):
    role: str
    content: str


def sync_f():
    client = OpenAI(
        base_url="http://0.0.0.0:8000/v1",
    )
    model = from_vllm_server(client)

    generator = Generator(model, output_type=Message)

    result = generator(
        "Hello, world!",
        model="facebook/opt-125m",
        extra_body={"chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}{{ message['content'] }}{% endif %}{% endfor %}"},
        max_tokens=10,
    )
    print("--------------- sync -----------------")
    print(result)
    print("--------------- sync streaming -----------------")
    for result in generator.stream(
        "Hello, world!",
        model="facebook/opt-125m",
        extra_body={"chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}{{ message['content'] }}{% endif %}{% endfor %}"},
        max_tokens=10,
    ):
        print(result)


def async_f():
    client = AsyncOpenAI(
        base_url="http://0.0.0.0:8000/v1",
    )
    model = from_vllm_server(client)

    generator = Generator(model, output_type=Message)

    async def generate():
        result = await generator(
            "Hello, world!",
            model="facebook/opt-125m",
            extra_body={"chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}{{ message['content'] }}{% endif %}{% endfor %}"},
            max_tokens=10,
        )
        print(result)
    
    print("--------------- async -----------------")
    asyncio.run(generate())

    async def generate_stream():
        # Use async for directly on the generator.stream() method
        async for result in generator.stream(
            "Hello, world!",
            model="facebook/opt-125m",
            extra_body={"chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}{{ message['content'] }}{% endif %}{% endfor %}"},
            max_tokens=5,
        ):
            await asyncio.sleep(0.1)
            print(result)

    print("--------------- async streaming -----------------")
    asyncio.run(generate_stream())


if __name__ == "__main__":
    sync_f()
    async_f()

rlouf · 2025-04-23T08:48:57Z

The integration of sync/async is very elegant. The only (minor) thing that bothers me is that, because we also support the offline inference we have to call the function from_vllm_server. I'm unsure how many people will be using the offline version, and I'm wondering if we should drop it if anything to not have to maintain something people are not going to be using.

RobinPicard · 2025-04-23T09:00:55Z

I agree that it's annoying to have to call it from_vllm_server. I can't tell you how popular vLLM offline is, but I think we also have the option of calling vLLM server just VLLM and rename the current VLLM into VLLMOffline. That way what is the most popular and what we want to put forward has the most straightforward name (I also find it a bit counter-intuitive to call "server" what's actually the client to a remote server)

rlouf · 2025-04-23T11:53:37Z

but I think we also have the option of calling vLLM server just VLLM and rename the current VLLM into VLLMOffline.

Let's do that.

cpfiffer · 2025-04-23T17:19:33Z

docs/reference/models/vllm_offline.md

+import outlines
+from vllm import LLM
+
+model = outlines.from_vllm_offline(LLM("microsoft/Phi-3-mini-4k-instruct"))


Unrelated to this PR, but how does LLM work here?

rlouf · 2025-04-25T14:35:30Z

The PR looks good, but we have a new unrelated error in the test workflow.

RobinPicard requested review from rlouf and cpfiffer April 22, 2025 17:23

rlouf reviewed Apr 22, 2025

View reviewed changes

outlines/generator.py Outdated Show resolved Hide resolved

rlouf reviewed Apr 22, 2025

View reviewed changes

outlines/models/vllm_server.py Outdated Show resolved Hide resolved

RobinPicard force-pushed the create_vllm_server_model branch from 702975a to ec739a2 Compare April 23, 2025 08:13

rlouf mentioned this pull request Apr 23, 2025

Ollama is a BlackBoxModel #1550

Open

rlouf linked an issue Apr 23, 2025 that may be closed by this pull request

Create a model to generate text through vLLM #1548

Open

RobinPicard force-pushed the create_vllm_server_model branch from ec739a2 to ea08e7f Compare April 23, 2025 15:26

cpfiffer reviewed Apr 23, 2025

View reviewed changes

Rename VLLM to VLLMOffline and create server-based VLLM/AsyncVLLM

0b9f76e

RobinPicard force-pushed the create_vllm_server_model branch from ea08e7f to 0b9f76e Compare April 25, 2025 14:07

RobinPicard marked this pull request as ready for review April 25, 2025 14:14

RobinPicard changed the title ~~Create VLLMServer and AsyncVLLMServer models~~ Rename VLLM to VLLMOffline and create server-based VLLM/AsyncVLLM Apr 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename VLLM to VLLMOffline and create server-based VLLM/AsyncVLLM #1549

Rename VLLM to VLLMOffline and create server-based VLLM/AsyncVLLM #1549

RobinPicard commented Apr 22, 2025

cpfiffer commented Apr 23, 2025

RobinPicard commented Apr 23, 2025 •

edited

Loading

rlouf commented Apr 23, 2025

RobinPicard commented Apr 23, 2025

rlouf commented Apr 23, 2025

cpfiffer Apr 23, 2025

rlouf commented Apr 25, 2025

Rename VLLM to VLLMOffline and create server-based VLLM/AsyncVLLM #1549

Are you sure you want to change the base?

Rename VLLM to VLLMOffline and create server-based VLLM/AsyncVLLM #1549

Conversation

RobinPicard commented Apr 22, 2025

cpfiffer commented Apr 23, 2025

RobinPicard commented Apr 23, 2025 • edited Loading

rlouf commented Apr 23, 2025

RobinPicard commented Apr 23, 2025

rlouf commented Apr 23, 2025

cpfiffer Apr 23, 2025

Choose a reason for hiding this comment

rlouf commented Apr 25, 2025

RobinPicard commented Apr 23, 2025 •

edited

Loading