⚡ Quantization
Quantization allows requests to be routed to quantized endpoints, which may deliver faster inference and lower costs, with a slight reduction in accuracy.
Because quantization affects inference behavior, this setting gives you explicit control over the model version used. You may choose to disable this feature to retain the original model for precision-critical tasks or regulatory compliance.
💡 Note: Quantized endpoints follow the same security and data retention as non-quantized endpoints.
Via Web Console
Go to your Project Settings → Inference Section.
Toggle Allow Quantization ON ✅to allow routing to quantized endpoints.

Toggle OFF ❌ to restrict routing to non-quantized endpoints only.

Via API
You can also control quantization directly in your requests using the allow_quantization parameter:
true✅ (default) → Quantized endpoints are allowedfalse❌ → Only non-quantized endpoints are used
For project-wide enforcement, you can also configure this setting via the Project Config API, ensuring all team requests follow the same policy. 🏢
When to use Quantization
Reduce latency for real-time or high-throughput workloads
Lower inference costs at scale
Accept small accuracy trade-offs in exchange for performance gains
Compliance Focus
Quantized and non-quantized endpoints are GDPR compliant
Data is never used for model training
Quantization alters inference behavior, which may impact EU AI Act conformity assessments. Disable this setting if strict adherence to the original model behavior is required.
Last updated