Batch jobs

Everything All at Once

Dedicated inference is the best way to process massive workloads. It enables parallel processing without restricting you to rate limits.

Example

DedicatedLLM returns the ChatOpenAI object from LangChain. As LangChain supports batched inference with its Langchain Expression Language, it is easy to execute batch jobs using the batch(.) method.

with DedicatedLLM(client=cortecs, model_name='<MDOEL_NAME>') as llm:
    chain = ... | llm
    summaries = chain.batch([{...} for doc in docs])

This simple example showcases the power of dynamic provisioning. We summarized 224.2k input tokens into 12.9k output tokens in 55 seconds.

from langchain_community.document_loaders import ArxivLoader
from langchain_core.prompts import ChatPromptTemplate

from cortecs_py.client import Cortecs
from cortecs_py.integrations.langchain import DedicatedLLM

cortecs = Cortecs()
loader = ArxivLoader(
    query="reasoning",
    load_max_docs=40,
    get_ful_documents=True,
    doc_content_chars_max=25000,  # ~6.25k tokens, make sure the models supports that context length
    load_all_available_meta=False
)

prompt = ChatPromptTemplate.from_template("{text}\n\n Explain to me like I'm five:")
docs = loader.load()

with DedicatedLLM(client=cortecs, model_name='cortecs/phi-4-FP8-Dynamic') as llm:
    chain = prompt | llm

    print("Processing data batch-wise ...")
    summaries = chain.batch([{"text": doc.page_content} for doc in docs])
    for summary in summaries:
        print(summary.content + '-------\n\n\n')

The llm can be fully utilized in those 55 seconds enabling superior cost efficiency without rate limits.

Last updated