Introduction

Maximizing Speed, Minimizing Token Costs

Welcome to the cortecs docs! cortecs makes it easy to run dedicated language models at maximum performance.

Why dedicated Inference?

Dedicated inference offers exclusive access to a specific model, ensuring that you are the sole user of the underlying compute resources. This makes it particularly suitable for applications that:

  • Need guaranteed latency

  • Have a heavy workload

  • Require many requests (no request limits)

  • Require high data security

What are LLM workers?

In some use cases, like batched or scheduled jobs, it is useful to start the compute resources and shut them down automatically after the job is finished. Dynamic provisioning allows you to do exactly that - start an instance of the desired model, execute the job requiring LLM resources and shut it down when the resources are no longer required. That way, you are paying for the exact amount of resources you used, and not a minute more!

For examples see cortecs-py.

Which model should I use?

cortecs offers a variety of popular models. Visit our models page to explore the available options. Generally, more complex tasks require larger models, while smaller models provide faster performance. For most use cases, we recommend models supporting 🔵 Instant provisioning.

Don't see a model you want to use? Join our Discord to add or upvote the model you'd love to use.

Next steps

Resources

Last updated