docs cortecs
cortecs.aiModels
  • Getting started
    • Introduction
    • Quickstart
    • LLM Workers
  • Examples
    • Basics
    • Structured output
    • Batch jobs
    • Multi-agents
    • Realtime streams
  • cortecs-py
    • Python client
      • Objects
    • Integrations
  • API
    • Authentication
    • User
    • Instances
    • Models
    • Hardware Types
  • Discord
Powered by GitBook
On this page
  • FP8
  • GPTQ
  1. Getting started

Quantization

Better, Faster, Stronger

Last updated 9 months ago

FP8 and GPTQ are quantization methodologies that strike a balance between a slight reduction in accuracy and significant reductions in computing demand and energy consumption. Quantization is an effective solution for maximizing speed and minimizing costs. We offer 8-bit quantized models using FP8 and 4-bit quantized models using GPTQ.

FP8

FP8 is our recommended option for running large language models (LLMs) due to its exceptional balance of speed and quality. This quantization method doubles inference speed while maintaining nearly the same level of accuracy as the full base model. FP8 achieves a recovery rate of around 99.9%, meaning that the performance of FP8-quantized models is almost indistinguishable from that of their full-precision counterparts. Additionally, FP8 is designed to minimize the computational and energy costs associated with model inference, making it an ideal choice for applications requiring both efficiency and high-quality results.

GPTQ

GPTQ 4-bit models enable you to run large models on a cost-effective setup. They are particularly useful for inexpensive prototyping and testing. The recovery rate for GPTQ models typically ranges from 96-99%.

GPTQ is more cost-effective but slower than FP8. For production, we recommend FP8 for its superior speed and quality.

Llama-3 GPTQ quantization in comparison to the original model across several datasets. The quantization reaches 98% of the original models performance although it can be inferenced at a quarter of the costs.