⚡ Quantization

Quantization allows requests to be routed to quantized endpoints, which may deliver faster inference and lower costs, with a slight reduction in accuracy.

Because quantization affects inference behavior, this setting gives you explicit control over the model version used. You may choose to disable this feature to retain the original model for precision-critical tasks or regulatory compliance.

💡 Note: Quantized endpoints follow the same security and data retention as non-quantized endpoints.

Via Web Console

Go to your Project Settings → Inference Section.
Toggle Allow Quantization ON ✅to allow routing to quantized endpoints.

Toggle OFF ❌ to restrict routing to non-quantized endpoints only.

Via API

You can also control quantization directly in your requests using the allow_quantization parameter:

{
  "model": "devstral-2512",
  "messages": [...],
  "allow_quantization": true
}

true ✅ (default) → Quantized endpoints are allowed
false ❌ → Only non-quantized endpoints are used

For project-wide enforcement, you can also configure this setting via the Project Config API, ensuring all team requests follow the same policy. 🏢

When to use Quantization

Reduce latency for real-time or high-throughput workloads
Lower inference costs at scale
Accept small accuracy trade-offs in exchange for performance gains

Compliance Focus

Quantized and non-quantized endpoints are GDPR compliant
Data is never used for model training
Quantization alters inference behavior, which may impact EU AI Act conformity assessments. Disable this setting if strict adherence to the original model behavior is required.

Previous🗑️ Zero Data Retention (ZDR)NextAdvanced Usage

Last updated 29 days ago

hashtagVia Web Console

hashtagVia API

hashtag When to use Quantization

hashtag Compliance Focus

Via Web Console

Via API

When to use Quantization

Compliance Focus