-
Notifications
You must be signed in to change notification settings - Fork 590
Rename VLLM to VLLMOffline and create server-based VLLM/AsyncVLLM #1549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v1.0
Are you sure you want to change the base?
Conversation
Do you have an example of use here? |
702975a
to
ec739a2
Compare
I was using the script below to test it @cpfiffer Scriptimport asyncio
from openai import AsyncOpenAI, OpenAI
from outlines.models import from_vllm_server
from outlines.generator import Generator
from pydantic import BaseModel
class Message(BaseModel):
role: str
content: str
def sync_f():
client = OpenAI(
base_url="http://0.0.0.0:8000/v1",
)
model = from_vllm_server(client)
generator = Generator(model, output_type=Message)
result = generator(
"Hello, world!",
model="facebook/opt-125m",
extra_body={"chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}{{ message['content'] }}{% endif %}{% endfor %}"},
max_tokens=10,
)
print("--------------- sync -----------------")
print(result)
print("--------------- sync streaming -----------------")
for result in generator.stream(
"Hello, world!",
model="facebook/opt-125m",
extra_body={"chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}{{ message['content'] }}{% endif %}{% endfor %}"},
max_tokens=10,
):
print(result)
def async_f():
client = AsyncOpenAI(
base_url="http://0.0.0.0:8000/v1",
)
model = from_vllm_server(client)
generator = Generator(model, output_type=Message)
async def generate():
result = await generator(
"Hello, world!",
model="facebook/opt-125m",
extra_body={"chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}{{ message['content'] }}{% endif %}{% endfor %}"},
max_tokens=10,
)
print(result)
print("--------------- async -----------------")
asyncio.run(generate())
async def generate_stream():
# Use async for directly on the generator.stream() method
async for result in generator.stream(
"Hello, world!",
model="facebook/opt-125m",
extra_body={"chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}{{ message['content'] }}{% endif %}{% endfor %}"},
max_tokens=5,
):
await asyncio.sleep(0.1)
print(result)
print("--------------- async streaming -----------------")
asyncio.run(generate_stream())
if __name__ == "__main__":
sync_f()
async_f() |
The integration of sync/async is very elegant. The only (minor) thing that bothers me is that, because we also support the offline inference we have to call the function |
I agree that it's annoying to have to call it |
Let's do that. |
ec739a2
to
ea08e7f
Compare
import outlines | ||
from vllm import LLM | ||
|
||
model = outlines.from_vllm_offline(LLM("microsoft/Phi-3-mini-4k-instruct")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but how does LLM
work here?
ea08e7f
to
0b9f76e
Compare
The PR looks good, but we have a new unrelated error in the test workflow. |
Addresses #1548
For the moment, this PR only creates the vllm server models and modifies the generator