GPU acceleration

This guide explains how to update your existing program to leverage GPU acceleration, or to start a new program using GPU.

TFHE-rs now supports a GPU backend with CUDA implementation, enabling integer arithmetics operations on encrypted data.

Prerequisites

Importing to your project

To use the TFHE-rs GPU backend in your project, add the following dependency in your Cargo.toml.

If you are using an x86 machine:

tfhe = { version = "0.6.4", features = [ "boolean", "shortint", "integer", "x86_64-unix", "gpu" ] }

If you are using an ARM machine:

tfhe = { version = "0.6.4", features = [ "boolean", "shortint", "integer", "aarch64-unix", "gpu" ] }
circle-check

Supported platforms

TFHE-rs GPU backend is supported on Linux (x86, aarch64).

OS
x86
aarch64

Linux

x86_64-unix

aarch64-unix*

macOS

Unsupported

Unsupported*

Windows

Unsupported

Unsupported

A first example

Configuring and creating keys.

Comparing to the CPU example, GPU set up differs in the key creation, as detailed here

Here is a full example (combining the client and server parts):

Setting the keys

The configuration of the key is different from the CPU. More precisely, if both client and server keys are still generated by the client (which is assumed to run on a CPU), the server key has then to be decompressed by the server to be converted into the right format. To do so, the server should run this function: decompressed_to_gpu().

Once decompressed, the operations between CPU and GPU are identical.

Encryption

On the client-side, the method to encrypt the data is exactly the same than the CPU one, as shown in the following example:

Computation

The server first need to set up its keys with set_server_key(gpu_key).

Then, homomorphic computations are performed using the same approach as the CPU operations.

Decryption

Finally, the client decrypts the results using:

Improving performance.

TFHE-rs allows to leverage the high number of threads given by a GPU. To maximize the number of GPU threads, update your configuration accordingly:

Here's the complete example:

List of available operations

The GPU backend includes the following operations:

name

symbol

Enc/Enc

Enc/ Int

Neg

-

✔️

N/A

Add

+

✔️

✔️

Sub

-

✔️

✔️

Mul

*

✔️

✔️

Div

/

✖️

✖️

Rem

%

✖️

✖️

Not

!

✔️

N/A

BitAnd

&

✔️

✔️

BitOr

|

✔️

✔️

BitXor

^

✔️

✔️

Shr

>>

✔️

✔️

Shl

<<

✔️

✔️

Rotate right

rotate_right

✔️

✔️

Rotate left

rotate_left

✔️

✔️

Min

min

✔️

✔️

Max

max

✔️

✔️

Greater than

gt

✔️

✔️

Greater or equal than

ge

✔️

✔️

Lower than

lt

✔️

✔️

Lower or equal than

le

✔️

✔️

Equal

eq

✔️

✔️

Cast (into dest type)

cast_into

✖️

N/A

Cast (from src type)

cast_from

✖️

N/A

Ternary operator

if_then_else

✔️

✖️

The equivalent signed operations are also available.

circle-info

All operations follow the same syntax than the one described in here.

Benchmarks

All GPU benchmarks presented here were obtained on a single H100 GPU, and rely on the multithreaded PBS algorithm. The cryptographic parameters PARAM_GPU_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS were used.

The following table shows the performance when the inputs of the benchmarked operation are encrypted:

Operation \ Size

FheUint7

FheUint16

FheUint32

FheUint64

FheUint128

FheUint256

Negation (-)

46 ms

60 ms

75 ms

94 ms

150 ms

247 ms

Add / Sub (+,-)

46 ms

60 ms

75 ms

94 ms

150 ms

247 ms

Mul (x)

83 ms

121 ms

195 ms

456 ms

1.35 s

4.74 s

Equal / Not Equal (eq, ne)

25 ms

26 ms

38 ms

41 ms

52 ms

78 ms

Comparisons (ge, gt, le, lt)

46 ms

60 ms

74 ms

90 ms

109 ms

153 ms

Max / Min (max,min)

71 ms

86 ms

101 ms

124 ms

165 ms

236 ms

Bitwise operations (&, |, ^)

11 ms

12 ms

13 ms

15 ms

23 ms

32 ms

Left / Right Shifts (<<, >>)

71 ms

88 ms

109 ms

180 ms

279 ms

494 ms

Left / Right Rotations (left_rotate, right_rotate)

71 ms

88 ms

109 ms

180 ms

279 ms

494 ms

The following table shows the performance when the left input of the benchmarked operation is encrypted and the other is a clear scalar of the same size:

Operation \ Size

FheUint7

FheUint16

FheUint32

FheUint64

FheUint128

FheUint256

Add / Sub (+,-)

46 ms

60 ms

75 ms

94 ms

152 ms

251 ms

Mul (*)

67 ms

101 ms

149 ms

282 ms

727 ms

2.11 s

Equal / Not Equal (eq, ne)

26 ms

27 ms

27 ms

41 ms

45 ms

57 ms

Comparisons (ge, gt, le, lt)

29 ms

41 ms

54 ms

69 ms

87 ms

117 ms

Max / Min (max,min)

53 ms

65 ms

81 ms

102 ms

142 ms

200 ms

Bitwise operations (&, |, ^)

11 ms

13 ms

13 ms

15 ms

23 ms

32 ms

Left / Right Shifts (<<, >>)

11 ms

12 ms

13 ms

15 ms

23 ms

32 ms

Left / Right Rotations (left_rotate, right_rotate)

11 ms

12 ms

13 ms

15 ms

23 ms

32 ms

Last updated

Was this helpful?