Skip to content

Rename VLLM to VLLMOffline and create server-based VLLM/AsyncVLLM #1549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: v1.0
Choose a base branch
from

Conversation

RobinPicard
Copy link
Contributor

Addresses #1548

For the moment, this PR only creates the vllm server models and modifies the generator

@RobinPicard RobinPicard requested review from rlouf and cpfiffer April 22, 2025 17:23
@cpfiffer
Copy link
Contributor

Do you have an example of use here?

@RobinPicard RobinPicard force-pushed the create_vllm_server_model branch from 702975a to ec739a2 Compare April 23, 2025 08:13
@RobinPicard
Copy link
Contributor Author

RobinPicard commented Apr 23, 2025

Do you have an example of use here?

I was using the script below to test it @cpfiffer

Script
import asyncio
from openai import AsyncOpenAI, OpenAI
from outlines.models import from_vllm_server
from outlines.generator import Generator

from pydantic import BaseModel


class Message(BaseModel):
    role: str
    content: str


def sync_f():
    client = OpenAI(
        base_url="http://0.0.0.0:8000/v1",
    )
    model = from_vllm_server(client)

    generator = Generator(model, output_type=Message)

    result = generator(
        "Hello, world!",
        model="facebook/opt-125m",
        extra_body={"chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}{{ message['content'] }}{% endif %}{% endfor %}"},
        max_tokens=10,
    )
    print("--------------- sync -----------------")
    print(result)
    print("--------------- sync streaming -----------------")
    for result in generator.stream(
        "Hello, world!",
        model="facebook/opt-125m",
        extra_body={"chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}{{ message['content'] }}{% endif %}{% endfor %}"},
        max_tokens=10,
    ):
        print(result)


def async_f():
    client = AsyncOpenAI(
        base_url="http://0.0.0.0:8000/v1",
    )
    model = from_vllm_server(client)

    generator = Generator(model, output_type=Message)

    async def generate():
        result = await generator(
            "Hello, world!",
            model="facebook/opt-125m",
            extra_body={"chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}{{ message['content'] }}{% endif %}{% endfor %}"},
            max_tokens=10,
        )
        print(result)
    
    print("--------------- async -----------------")
    asyncio.run(generate())

    async def generate_stream():
        # Use async for directly on the generator.stream() method
        async for result in generator.stream(
            "Hello, world!",
            model="facebook/opt-125m",
            extra_body={"chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}{{ message['content'] }}{% endif %}{% endfor %}"},
            max_tokens=5,
        ):
            await asyncio.sleep(0.1)
            print(result)

    print("--------------- async streaming -----------------")
    asyncio.run(generate_stream())


if __name__ == "__main__":
    sync_f()
    async_f()

@rlouf
Copy link
Member

rlouf commented Apr 23, 2025

The integration of sync/async is very elegant. The only (minor) thing that bothers me is that, because we also support the offline inference we have to call the function from_vllm_server. I'm unsure how many people will be using the offline version, and I'm wondering if we should drop it if anything to not have to maintain something people are not going to be using.

@rlouf rlouf linked an issue Apr 23, 2025 that may be closed by this pull request
@RobinPicard
Copy link
Contributor Author

I agree that it's annoying to have to call it from_vllm_server. I can't tell you how popular vLLM offline is, but I think we also have the option of calling vLLM server just VLLM and rename the current VLLM into VLLMOffline. That way what is the most popular and what we want to put forward has the most straightforward name (I also find it a bit counter-intuitive to call "server" what's actually the client to a remote server)

@rlouf
Copy link
Member

rlouf commented Apr 23, 2025

but I think we also have the option of calling vLLM server just VLLM and rename the current VLLM into VLLMOffline.

Let's do that.

@RobinPicard RobinPicard force-pushed the create_vllm_server_model branch from ec739a2 to ea08e7f Compare April 23, 2025 15:26
import outlines
from vllm import LLM

model = outlines.from_vllm_offline(LLM("microsoft/Phi-3-mini-4k-instruct"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but how does LLM work here?

@RobinPicard RobinPicard force-pushed the create_vllm_server_model branch from ea08e7f to 0b9f76e Compare April 25, 2025 14:07
@RobinPicard RobinPicard marked this pull request as ready for review April 25, 2025 14:14
@rlouf
Copy link
Member

rlouf commented Apr 25, 2025

The PR looks good, but we have a new unrelated error in the test workflow.

@RobinPicard RobinPicard changed the title Create VLLMServer and AsyncVLLMServer models Rename VLLM to VLLMOffline and create server-based VLLM/AsyncVLLM Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a model to generate text through vLLM
3 participants