⚡ Quantization

Quantization allows requests to be routed to quantized endpoints, which may deliver faster inference and lower costs, with a slight reduction in accuracy.

Because quantization affects inference behavior, this setting gives you explicit control over the model version used. You may choose to disable this feature to retain the original model for precision-critical tasks or regulatory compliance.

💡 Note: Quantized endpoints follow the same security and data retention as non-quantized endpoints.

Via Web Console

  1. Go to your Project Settingsarrow-up-right → Inference Section.

  2. Toggle Allow Quantization ON ✅to allow routing to quantized endpoints.

  1. Toggle OFF ❌ to restrict routing to non-quantized endpoints only.

Via API

You can also control quantization directly in your requests using the allow_quantization parameter:

  • true(default) → Quantized endpoints are allowed

  • false ❌ → Only non-quantized endpoints are used

For project-wide enforcement, you can also configure this setting via the Project Config APIarrow-up-right, ensuring all team requests follow the same policy. 🏢

When to use Quantization

  • Reduce latency for real-time or high-throughput workloads

  • Lower inference costs at scale

  • Accept small accuracy trade-offs in exchange for performance gains

Compliance Focus

  • Quantized and non-quantized endpoints are GDPR compliant

  • Data is never used for model training

  • Quantization alters inference behavior, which may impact EU AI Act conformity assessments. Disable this setting if strict adherence to the original model behavior is required.

Last updated