Batch jobs
Everything All at Once
Dedicated inference is the best way to process massive workloads. It enables parallel processing without restricting you to rate limits.
Example
DedicatedLLM
returns the ChatOpenAI object from LangChain. As LangChain supports batched inference with its Langchain Expression Language, it is easy to execute batch jobs using the batch(.)
method.
with DedicatedLLM(client=cortecs, model_name='<MDOEL_NAME>') as llm:
chain = ... | llm
summaries = chain.batch([{...} for doc in docs])
This simple example showcases the power of dynamic provisioning. We summarized 224.2k input tokens into 12.9k output tokens in 55 seconds.
from langchain_community.document_loaders import ArxivLoader
from langchain_core.prompts import ChatPromptTemplate
from cortecs_py.client import Cortecs
from cortecs_py.integrations.langchain import DedicatedLLM
cortecs = Cortecs()
loader = ArxivLoader(
query="reasoning",
load_max_docs=40,
get_ful_documents=True,
doc_content_chars_max=25000, # ~6.25k tokens, make sure the models supports that context length
load_all_available_meta=False
)
prompt = ChatPromptTemplate.from_template("{text}\n\n Explain to me like I'm five:")
docs = loader.load()
with DedicatedLLM(client=cortecs, model_name='cortecs/phi-4-FP8-Dynamic') as llm:
chain = prompt | llm
print("Processing data batch-wise ...")
summaries = chain.batch([{"text": doc.page_content} for doc in docs])
for summary in summaries:
print(summary.content + '-------\n\n\n')
The llm can be fully utilized in those 55 seconds enabling superior cost efficiency without rate limits.
Last updated