Streaming inference

Streaming inference is supported out-of-the-box, allowing you to receive responses from the model in real-time as they are generated. This feature is particularly useful for applications that require immediate feedback or need to process large amounts of data incrementally.

Using OpenAI

To enable streaming with the OpenAI library, set the stream parameter to true.

from openai import OpenAI

client = OpenAI(api_key='<API_KEY>',
                base_url='<MODEL_URL>')

response = client.chat.completions.create(
  model="meta-llama/Meta-Llama-3.1-8B-Instruct",
  messages=[
    {"role": "user", "content": "Tell me a joke."}
  ],
  stream=True
)

for chunk in completion:
    print(chunk.choices[0].delta.content)

import OpenAI from "openai";

const openai = new OpenAI({
    apiKey: '<API_KEY>',
    baseURL: '<MODEL_URL>'
});

async function main() {
  const completion = await openai.chat.completions.create({
    messages: [{ role: "user", content: "Tell me a joke." }],
    model: "meta-llama/Meta-Llama-3.1-8B-Instruct",
    stream: true
  });

  for await (const part of completion) {
    console.log(part.choices[0]?.delta?.content || '');
  }
}

main();

Using LangChain

Streaming is also supported in LangChain, offering fine-grained control over streaming responses. For detailed usage, refer to the LangChain docs.

from langchain_openai import OpenAI

llm = OpenAI(openai_api_key='<API_KEY>',
             openai_api_base='<MODEL_URL>',
             model_name='meta-llama/Meta-Llama-3.1-8B-Instruct')

for chunk in llm.stream('Tell me a joke.'):
    print(chunk)

Last updated 7 months ago