# Optimizing Inference

Neural networks pose unique challenges with regards to encrypted inference. Each neuron in a network applies an activation function that requires a PBS operation. The latency of a single PBS depends on the bit-width of the input of the PBS.

Several approaches can be used to reduce the overall latency of a neural network.

## Circuit bit-width optimization

[Quantization Aware Training](https://docs.zama.org/concrete-ml/1.1/advanced-topics/quantization) and [pruning](https://docs.zama.org/concrete-ml/1.1/advanced-topics/pruning) introduce specific hyper-parameters that influence the accumulator sizes. It is possible to chose quantization and pruning configurations that reduce the accumulator size. A trade-off between latency and accuracy can be obtained by varying these hyper-parameters as described in the [deep learning design guide](https://docs.zama.org/concrete-ml/1.1/torch_support#configuring-quantization-parameters).

## Structured pruning

While un-structured pruning is used to ensure the accumulator bit-width stays low, [structured pruning](https://pytorch.org/docs/stable/generated/torch.nn.utils.prune.ln_structured.html) can eliminate entire neurons from the network. Many neural networks are over-parametrized (since this enables easier training) and some neurons can be removed. Structured pruning, applied to a trained network as a fine-tuning step, can be applied to built-in neural networks using the [prune](https://docs.zama.org/concrete-ml/1.1/developer-guide/api/concrete.ml.sklearn.base#method-prune) helper function as shown in [this example](https://github.com/zama-ai/concrete-ml/blob/release/1.1.x/docs/advanced_examples/FullyConnectedNeuralNetworkOnMNIST.ipynb). To apply structured pruning to custom models, it is recommended to use the [torch-pruning](https://github.com/VainF/Torch-Pruning) package.

## Rounded activations and quantizers

Reducing the bit-width of the inputs to the Table Lookup (TLU) operations is a major source of improvements in the latency. Post-training, it is possible to leverage some properties of the fused activation and quantization functions expressed in the TLUs to further reduce the accumulator. This is achieved through the *rounded PBS* feature as described in the [rounded activations and quantizers reference](https://docs.zama.org/concrete-ml/1.1/advanced-topics/advanced_features#rounded-activations-and-quantizers). Adjusting the rounding amount, relative to the initial accumulator size, can bring large improvements in latency while maintaining accuracy.

## TLU error probability adjustment

Finally, the TFHE scheme exposes a TLU error probability parameter that has an impact on crypto-system parameters that influence latency. A higher probability of TLU error results in faster computations but may reduce accuracy. One can think of the error of obtaining $$T\[x]$$ as a Gaussian distribution centered on $$x$$: $$TLU\[x]$$ is obtained with probability of `1 - p_error`, while $$T\[x-1]$$, $$T\[x+1]$$ are obtained with much lower probability, etc. In Deep NNs, these type of errors can be tolerated up to some point. See the [`p_error` documentation for details](https://docs.zama.org/concrete-ml/1.1/advanced-topics/advanced_features#approximate-computations) and more specifically the usage example of [the API for finding the best `p_error`](https://docs.zama.org/concrete-ml/1.1/advanced-topics/advanced_features#searching-for-the-best-error-probability).
