Streaming inference
Streaming inference is supported out-of-the-box, allowing you to receive responses from the model in real-time as they are generated. This feature is particularly useful for applications that require immediate feedback or need to process large amounts of data incrementally.
Using OpenAI
To enable streaming with the OpenAI library, set the stream
parameter to true
.
from openai import OpenAI
client = OpenAI(api_key='<API_KEY>',
base_url='<MODEL_URL>')
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "Tell me a joke."}
],
stream=True
)
for chunk in completion:
print(chunk.choices[0].delta.content)
Using LangChain
Streaming is also supported in LangChain, offering fine-grained control over streaming responses. For detailed usage, refer to the LangChain docs.
from langchain_openai import OpenAI
llm = OpenAI(openai_api_key='<API_KEY>',
openai_api_base='<MODEL_URL>',
model_name='meta-llama/Meta-Llama-3.1-8B-Instruct')
for chunk in llm.stream('Tell me a joke.'):
print(chunk)
Last updated