**Disclaimer**: I am **NOT** a cryptographer and you won’t learn how to properly use or implement AES in this post. This post explains an optimization of an algorithm that rely on precomputed values, it just happens that this algorithm is the AES cipher.

### Context

I recently looked to implement a toy AES in order to understand a bit more in depth how it works. In the process, I stumbled upon Golang’s AES implementation, which is in the crypto/aes package. It is derived from a optimised implementation in ANSI C that dates from December, 2000 and was developed by V. Rijmen (one of the original authors of AES along with J. Daemen), A. Bosselaers and P. Barreto.

This implementation relies on 4 pre-computed tables containing 256 32-bit magic values for encryption, and 4 other tables for decryption. For information, this implementation is not only used in Golang’s library, but also in mBedTLS, OpenSSL and probably in some other places.

**Note**: After finishing this post, I saw that those tables (called **T-tables**) are described in the original proposal for AES, which can be found here, in section 5.2.

This post is organised bottom-up, with as little math as possible (bitwise manipulations only). First the encryption algorithm as defined in the AES standard is presented. Then, we’ll see why pre-computed tables can be used and how to generate them, yielding to a second implementation of the algorithm: this version will be really close to Golang’s one (though I didn’t copy). Finally, we’ll see how we can generate T-tables for the decryption algorithm.

### Presentation of AES encryption

In this post, we’ll look at the core of the AES cipher : the encryption or decryption of a 16-bytes block with a given key. The first step of the process if the extension of the key according to the key expansion algorithm standardised alongside AES. This expansion algorithm derives several round keys from a given one. Depending of the size of the key, several AES parameters are adjusted: the size of the expanded key and the number of rounds.

- AES-128: Key size: 128 bits, Extended key size: 176 bytes, Nb of rounds: 10
- AES-192: Key size: 192 bits, Extended key size: 208 bytes, Nb of rounds: 12
- AES-256: Key size: 256 bits, Extended key size: 240 bytes, Nb of rounds: 14

The AES cipher relies on 4 basic operations for encryption: **AddRoundKey**, **SubBytes**, **ShiftRows** and **MixColumns**. Given an extended key and a number of rounds, the following high-level code implements AES encryption.

```
// Encrypts a 16-bytes block src using AES encryption with
// extended key xk and for nr rounds.
// It returns the encrypted block.
func encrypt(src, xk []byte, nr int) []byte {
state := make([]byte, 16)
// Initialize the state with the plaintext block
copy(state, src)
// Add the first round key to the state
AddRoundKey(xk[0:16], state)
for i:= 1; i < nr; i++ {
SubBytes(state)
ShiftRows(state)
MixColumns(state)
AddRoundKey(xk[i*16:(i+1)*16], state)
}
SubBytes(state)
ShiftRows(state)
AddRoundKey(xk[nr*16:(nr+1)*16], state)
return state
}
```

And that’s it! The underlying operations are rather simple to understand. There is an initialization that is done at the start of the algorithm, then `nr-1`

full rounds of SubBytes, ShiftRows, MixColumns and AddRoundKey. Finally, the last round is performed without the MixColumns operaion.

The algorithm relies on a state that is updated by each operation and will hold the encrypted block at the end. This state is a 16-bytes value `b0...b15`

stored in a 4 by 4 array, arranged in column-major mode:

```
State:
+----+----+-----+-----+
| b0 | b4 | b8 | b12 |
+----+----+-----+-----+
| b1 | b5 | b9 | b13 |
+----+----+-----+-----+
| b2 | b6 | b10 | b14 |
+----+----+-----+-----+
| b3 | b7 | b11 | b15 |
+----+----+-----+-----+
```

Starting from this initial state, a full iteration of the loop will be performed to present the various operations involved.

#### SubBytes

This substitutes every byte of the state using an S-Box. The construction of AES’s S-Box is detailed in many places and much better than I could do. What’s important here is that an S-box is a lookup table which transforms each byte into a matching one. The process is fully reversible with an Inverse S-Box (used for decryption), such that `S[IS[b]] = IS[S[b]] = b`

. Here is the resulting state after going through the SubBytes operation:

```
+-------+-------+--------+--------+
| S[b0] | S[b4] | S[b8] | S[b12] |
+-------+-------+--------+--------+
| S[b1] | S[b5] | S[b9] | S[b13] |
+-------+-------+--------+--------+
| S[b2] | S[b6] | S[b10] | S[b14] |
+-------+-------+--------+--------+
| S[b3] | S[b7] | S[b11] | S[b15] |
+-------+-------+--------+--------+
```

#### ShiftRows

As its name suggests, this operation works by shifting rows. The first row stays the same, the second row is rotated one cell to the left, the third row is rotated two cells to the left and the fourth row is rotated three cells to the left. After the **ShiftRows** operation, the updated state is:

```
+--------+--------+--------+--------+
| S[b0] | S[b4] | S[b8] | S[b12] |
+--------+--------+--------+--------+
| S[b5] | S[b9] | S[b13] | S[b1] |
+--------+--------+--------+--------+
| S[b10] | S[b14] | S[b2] | S[b6] |
+--------+--------+--------+--------+
| S[b15] | S[b3] | S[b7] | S[b11] |
+--------+--------+--------+--------+
```

#### MixColumns

This one is a bit tougher because it involves multiplication operations in a finite field, which behave quite differently than those we are used to. An entire Wikipedia article is dedicated to presenting this operation and how it works.

The idea is that each byte of a column in the updated state will depend of all the bytes of the same column in the old state. It works according to the following equations:

```
New state:
+----+----+-----+-----+
| r0 | r4 | r8 | r12 |
+----+----+-----+-----+
| r1 | r5 | r9 | r13 |
+----+----+-----+-----+
| r2 | r6 | r10 | r14 |
+----+----+-----+-----+
| r3 | r7 | r11 | r15 |
+----+----+-----+-----+
r0 = 2*S[b0] ^ 3*S[b5] ^ 1*S[b10] ^ 1*S[b15]
r1 = 1*S[b0] ^ 2*S[b5] ^ 3*S[b10] ^ 1*S[b15]
r2 = 1*S[b0] ^ 1*S[b5] ^ 2*S[b10] ^ 3*S[b15]
r3 = 3*S[b0] ^ 1*S[b5] ^ 1*S[b10] ^ 2*S[b15]
r4 = 2*S[b4] ^ 3*S[b9] ^ 1*S[b14] ^ 1*S[b3]
r5 = 1*S[b4] ^ 2*S[b9] ^ 3*S[b14] ^ 1*S[b3]
r6 = 1*S[b4] ^ 1*S[b9] ^ 2*S[b14] ^ 3*S[b3]
r7 = 3*S[b4] ^ 1*S[b9] ^ 1*S[b14] ^ 2*S[b3]
r8 = 2*S[b8] ^ 3*S[b13] ^ 1*S[b2] ^ 1*S[b7]
r9 = 1*S[b8] ^ 2*S[b13] ^ 3*S[b2] ^ 1*S[b7]
r10 = 1*S[b8] ^ 1*S[b13] ^ 2*S[b2] ^ 3*S[b7]
r11 = 3*S[b8] ^ 1*S[b13] ^ 1*S[b2] ^ 2*S[b7]
r12 = 2*S[b12] ^ 3*S[b1] ^ 1*S[b6] ^ 1*S[b11]
r13 = 1*S[b12] ^ 2*S[b1] ^ 3*S[b6] ^ 1*S[b11]
r14 = 1*S[b12] ^ 1*S[b1] ^ 2*S[b6] ^ 3*S[b11]
r15 = 3*S[b12] ^ 1*S[b1] ^ 1*S[b6] ^ 2*S[b11]
```

The `^`

denotes a XOR operation, while `*`

denotes the multiplication in a finite field (GF(256)). Essentially, this multiplication is such that the result of multiplying two bytes together will produce a byte, without overflowing. For example, `2*127 = 254`

and `2*128 = 27`

. For further references, arithmetic in the finite field GF(256) is detailed here. In order to perform those multiplications, it is possible to implement a generic `gmul(x,y)`

for every x and y. However, we know that we only need to multiply by 2 and 3 so we can build two lookup tables for those values. Actually, those tables can be found here.

#### AddRoundKey

The final step is to XOR the key for the current round with the state:

```
Final state:
+----------+----------+------------+------------+
| r0^xk[0] | r4^xk[4] | r8^xk[8] | r12^xk[12] |
+----------+----------+------------+------------+
| r1^xk[1] | r5^xk[5] | r9^xk[9] | r13^xk[13] |
+----------+----------+------------+------------+
| r2^xk[2] | r6^xk[6] | r10^xk[10] | r14^xk[14] |
+----------+----------+------------+------------+
| r3^xk[3] | r7^xk[7] | r11^xk[11] | r15^xk[15] |
+----------+----------+------------+------------+
```

### Implementing AES encryption

Now that all the needed operations have been presented, a naive implementation of AES encryption is presented, that will be optimized after. This is a **toy** implementation, do **NOT** use it in production code. If you ever want to use the AES cipher, use an existing implementation in a library that has been reviewed by competent people in this matter.

The first thing to consider is the representation of a state. Rather than a mere slice of bytes, we will use an implementation that was hinted on the AES standard: using four 32-bits words, one per column. Hence, a state will be an `[]uint32`

. In order to shorten the implementation of the **AddRoundKey** operation, we will also consider that the expanded key has been packed into an `[]uint32`

. The first value holds the first 4 bytes, the second value holds the next four, and so on and so forth…

As stated before, this implementation uses several lookup table: one for the S-box, one for the multiplication by 2 and one for the multiplication by 3. Those tables can be found here.

The implementation is rather straight-forward:

```
var sbox [256]byte = [256]byte{ /* ... */ }
var gmul2 [256]byte = [256]byte{ /* ... */ }
var gmul3 [256]byte = [256]byte{ /* ... */ }
// AddRoundKey XORs the round key xk with the state and puts the result in the state
func AddRoundKey(xk, state []uint32) {
state[0] ^= xk[0]
state[1] ^= xk[1]
state[2] ^= xk[2]
state[3] ^= xk[3]
}
// SubBytes replaces each byte of the state with its substitution in the S-box.
func SubBytes(state []uint32) {
for i := 0; i < 4; i++ {
state[i] = uint32(sbox[state[i]>>24])<<24 |
uint32(sbox[state[i]>>16&0xff])<<16 |
uint32(sbox[state[i]>>8&0xff])<<8 |
uint32(sbox[state[i]&0xff])
}
}
// ShiftRows modifies the state to shift rows to the left
// First row is left unchanged
// Second row is shifted by one cell
// Third row is shifted by two cells
// Fourth row is shifted by three cells
func ShiftRows(state []uint32) {
var s0, s1, s2, s3 uint32
s0 = state[0]&(0xff<<24) | state[1]&(0xff<<16) | state[2]&(0xff<<8) | state[3]&0xff
s1 = state[1]&(0xff<<24) | state[2]&(0xff<<16) | state[3]&(0xff<<8) | state[0]&0xff
s2 = state[2]&(0xff<<24) | state[3]&(0xff<<16) | state[0]&(0xff<<8) | state[1]&0xff
s3 = state[3]&(0xff<<24) | state[0]&(0xff<<16) | state[1]&(0xff<<8) | state[2]&0xff
state[0], state[1], state[2], state[3] = s0, s1, s2, s3
}
// MixColumns updates the state with the MixColumns operation of the AES. It uses
// two pre-computed tables that implement multiplication by 2 and by 3 in GF(256).
func MixColumns(state []uint32) {
var b0, b1, b2, b3, d0, d1, d2, d3 byte
for i := 0; i < 4; i++ {
b0 = byte(state[i] >> 24)
b1 = byte(state[i] >> 16)
b2 = byte(state[i] >> 8)
b3 = byte(state[i])
d0 = gmul2[b0] ^ gmul3[b1] ^ b2 ^ b3
d1 = gmul2[b1] ^ gmul3[b2] ^ b3 ^ b0
d2 = gmul2[b2] ^ gmul3[b3] ^ b0 ^ b1
d3 = gmul2[b3] ^ gmul3[b0] ^ b1 ^ b2
state[i] = uint32(d0)<<24 | uint32(d1)<<16 | uint32(d2)<<8 | uint32(d3)
}
}
```

### Optimizing the encryption

The optimisation presented here is the one used in Golang’s (but not only) standard library. It is referred as **T-tables** or **T-boxes**. The equations are presented in the proposed paper for the AES competition. However, they are not detailed in the subsequent AES standard.

The operation that can be optimised is MixColumns. As seen above, this operation transforms each column using XORs and multiplication. In general, applying MixColumns on a column c `c0,c1,c2,c3`

returns a column s `s0,s1,s2,s3`

:

```
s0 = 2*c0 ^ 3*c1 ^ 1*c2 ^ 1*c3
s1 = 1*c0 ^ 2*c1 ^ 3*c2 ^ 1*c3
s2 = 1*c0 ^ 1*c1 ^ 2*c2 ^ 3*c3
s3 = 3*c0 ^ 1*c1 ^ 1*c2 ^ 2*c3
```

Each column is an `uint32`

, the symbol `|`

will be used to denote concatenation of two bytes. In addition, the XOR operation between two bytes and the multiplication (in GF(256)) between two bytes both yield a byte without overflowing. The following simplifications can be done:

```
s = s0 | s1 | s2 | s3
s = 2*c0 ^ 3*c1 ^ 1*c2 ^ 1*c3 | <- Replace with the values above
1*c0 ^ 2*c1 ^ 3*c2 ^ 1*c3 |
1*c0 ^ 1*c1 ^ 2*c2 ^ 3*c3 |
3*c0 ^ 1*c1 ^ 1*c2 ^ 2*c3
s = (2*c0 | 1*c0 | 1*c0 | 3*c0) ^ <- Re-order terms
(3*c1 | 2*c1 | 1*c1 | 1*c1) ^
(1*c2 | 3*c2 | 2*c2 | 1*c2) ^
(1*c3 | 1*c3 | 3*c3 | 2*c3)
```

And this is it. With this formula, we see that it is possible to compute the resulting column with four 32-bits values XORed together. Each of those values is determined by one of the bytes of the original column: `(2*c0 | 1*c0 | 1*c0 | 3*c0)`

can be computed from c0, `(3*c1 | 2*c1 | 1*c1 | 1*c1)`

can be computed from c1, `(1*c2 | 3*c2 | 2*c2 | 1*c2)`

can be computed from c2 and `(1*c3 | 1*c3 | 3*c3 | 2*c3)`

can be computed from c3. As c0, c1, c2 and c3 are bytes, it is possible to pre-compute all possible values: those form the four tables te0, te1, te2 and te3.

With the simplification above, here is the formula to compute each table:

```
te0[i] = 2*i | 1*i | 1*i | 3*i
te1[i] = 3*i | 2*i | 1*i | 1*i
te2[i] = 1*i | 3*i | 2*i | 1*i
te3[i] = 1*i | 1*i | 3*i | 2*i
s = te0[c0] ^ te1[c1] ^ te2[c2] ^ te3[c3]
```

However, it is still possible to generate better tables, that would not only account for the MixColumns but also for the SubBytes operation. To demonstrate this, we just replace the column `c0|c1|c2|c3`

from above with the actual columns that we have in the AES cipher, that are `S[b0]|S[b5]|S[b10]|S[b15]`

, `S[b4]|S[b9]|S[b14]|S[b3]`

, `S[b8]|S[b13]|S[b2]|S[b7]`

and `S[b12]|S[b1]|S[b6]|S[b11]`

. Using the output state of the MixColumns operation, we have:

```
R0 R1 R2 R3
+----+----+-----+-----+
| r0 | r4 | r8 | r12 |
+----+----+-----+-----+
| r1 | r5 | r9 | r13 |
+----+----+-----+-----+
| r2 | r6 | r10 | r14 |
+----+----+-----+-----+
| r3 | r7 | r11 | r15 |
+----+----+-----+-----+
R0 = (2*S[b0] | 1*S[b0] | 1*S[b0] | 3*S[b0]) ^
(3*S[b5] | 2*S[b5] | 1*S[b5] | 1*S[b5]) ^
(1*S[b10] | 3*S[b10] | 2*S[b10] | 1*S[b10]) ^
(1*S[b15] | 1*S[b15] | 3*S[b15] | 2*S[b15])
R1 = (2*S[b4] | 1*S[b4] | 1*S[b4] | 3*S[b4]) ^
(3*S[b9] | 2*S[b9] | 1*S[b9] | 1*S[b9]) ^
(1*S[b14] | 3*S[b14] | 2*S[b14] | 1*S[b14]) ^
(1*S[b3] | 1*S[b3] | 3*S[b3] | 2*S[b3])
R2 = (2*S[b8] | 1*S[b8] | 1*S[b8] | 3*S[b8]) ^
(3*S[b13] | 2*S[b13] | 1*S[b13] | 1*S[b13]) ^
(1*S[b2] | 3*S[b2] | 2*S[b2] | 1*S[b2]) ^
(1*S[b7] | 1*S[b7] | 3*S[b7] | 2*S[b7])
R3 = (2*S[b12] | 1*S[b12] | 1*S[b12] | 3*S[b12]) ^
(3*S[b1] | 2*S[b1] | 1*S[b1] | 1*S[b1]) ^
(1*S[b6] | 3*S[b6] | 2*S[b6] | 1*S[b6]) ^
(1*S[b11] | 1*S[b11] | 3*S[b11] | 2*S[b11])
```

This decomposition shows that it is possible to include the SubBytes step in the lookup tables. The new formula to generate those tables is:

```
te0[i] = 2*S[i] | 1*S[i] | 1*S[i] | 3*S[i]
te1[i] = 3*S[i] | 2*S[i] | 1*S[i] | 1*S[i]
te2[i] = 1*S[i] | 3*S[i] | 2*S[i] | 1*S[i]
te3[i] = 1*S[i] | 1*S[i] | 3*S[i] | 2*S[i]
R0 = te0[b0] ^ te1[b5] ^ te2[b10] ^ te3[b15]
R1 = te0[b4] ^ te1[b9] ^ te2[b14] ^ te3[b3]
R2 = te0[b8] ^ te1[b13] ^ te2[b2] ^ te3[b7]
R3 = te0[b12] ^ te1[b1] ^ te2[b6] ^ te3[b11]
```

Those are the T-tables used by Golang it its implementation. The source code for Golang’s AES encryption can be found in src/crypto/aes/aes.go, but now a similar implementation can be developed easily:

```
var sbox [256]byte = [256]byte { /* ... */ }
var te0 [256]uint32 = [256]uint32{ /* ... */ }
var te1 [256]uint32 = [256]uint32{ /* ... */ }
var te2 [256]uint32 = [256]uint32{ /* ... */ }
var te3 [256]uint32 = [256]uint32{ /* ... */ }
// Encrypts a 16-bytes block src using AES encryption with
// extended key xk and for nr rounds.
// It returns the encrypted block.
func encrypt(src []byte, xk []uint32, nr int) []byte {
var tmp0, tmp1, tmp2, tmp3 uint32
// Initialize the state, stored in s0, s1, s2 and s3
s0 := uint32(src[0])<<24 | uint32(src[1])<<16 | uint32(src[2])<<8 | uint32(src[3])
s1 := uint32(src[4])<<24 | uint32(src[5])<<16 | uint32(src[6])<<8 | uint32(src[7])
s2 := uint32(src[8])<<24 | uint32(src[9])<<16 | uint32(src[10])<<8 | uint32(src[11])
s3 := uint32(src[12])<<24 | uint32(src[13])<<16 | uint32(src[14])<<8 | uint32(src[15])
// Add the first round key to the state
s0 ^= xk[0]
s1 ^= xk[1]
s2 ^= xk[2]
s3 ^= xk[3]
for i:= 1; i < nr; i++ {
// This performs SubBytes + ShiftRows + MixColumns + AddRoundKey
tmp0 = te0[s0>>24] ^ te1[s1>>16&0xff] ^ te2[s2>>8&0xff] ^ te3[s3&0xff] ^ xk[4*i]
tmp1 = te0[s1>>24] ^ te1[s2>>16&0xff] ^ te2[s3>>8&0xff] ^ te3[s0&0xff] ^ xk[4*i+1]
tmp2 = te0[s2>>24] ^ te1[s3>>16&0xff] ^ te2[s0>>8&0xff] ^ te3[s1&0xff] ^ xk[4*i+2]
tmp3 = te0[s3>>24] ^ te1[s0>>16&0xff] ^ te2[s1>>8&0xff] ^ te3[s2&0xff] ^ xk[4*i+3]
s0, s1, s2, s3 = tmp0, tmp1, tmp2, tmp3
}
// Perform the last SubBytes + ShiftRows + AddRoundKey
tmp0 = (uint32(sbox[s0>>24])<<24 | uint32(sbox[s1>>16&0xff])<<16 | uint32(sbox[s2>>8&0xff])<<8 | uint32(sbox[s3&0xff])) ^ xk[4*nr]
tmp1 = (uint32(sbox[s1>>24])<<24 | uint32(sbox[s2>>16&0xff])<<16 | uint32(sbox[s3>>8&0xff])<<8 | uint32(sbox[s0&0xff])) ^ xk[4*nr+1]
tmp2 = (uint32(sbox[s2>>24])<<24 | uint32(sbox[s3>>16&0xff])<<16 | uint32(sbox[s0>>8&0xff])<<8 | uint32(sbox[s1&0xff])) ^ xk[4*nr+2]
tmp3 = (uint32(sbox[s3>>24])<<24 | uint32(sbox[s0>>16&0xff])<<16 | uint32(sbox[s1>>8&0xff])<<8 | uint32(sbox[s2&0xff])) ^ xk[4*nr+3]
s0, s1, s2, s3 = tmp0, tmp1, tmp2, tmp3
// Unpack the state into the destination block. For performance purpose dst could be passed to the function
// rather than being allocated each time.
dst := make([]byte, 16)
dst[0], dst[1], dst[2], dst[3] = byte(s0>>24), byte(s0>>16), byte(s0>>8), byte(s0)
dst[4], dst[5], dst[6], dst[7] = byte(s1>>24), byte(s1>>16), byte(s1>>8), byte(s1)
dst[8], dst[9], dst[10], dst[11] = byte(s2>>24), byte(s2>>16), byte(s2>>8), byte(s2)
dst[12], dst[13], dst[14], dst[15] = byte(s3>>24), byte(s3>>16), byte(s3>>8), byte(s3)
return dst
}
```

Again, please do not use this code for anything else than playing with it. My way of generating T-tables rely on the pre-computation of the multiplication by 2 and by 3 and on a pre-computed S-Box. Those values can be found on the Internet, and are available here.

```
// S-box values
var sbox [256]byte = [256]byte{ /* ... */ }
// Multiplication by 2 in GF(256)
var gmul2 [256]byte = [256]byte{ /* ... */ }
// Multiplication by 3 in GF(256)
var gmul3 [256]byte = [256]byte{ /* ... */ }
func generateEncryptionTable() [4][256]uint32 {
var w uint32
// res[0] is te0, res[1] is te1, res[2] is te2, res[3] is te3
var res [4][256]uint32
for i := 0; i < 256; i++ {
w = uint32(gmul2[sbox[i]])<<24 | uint32(sbox[i])<<16 | uint32(sbox[i])<<8 | uint32(gmul3[sbox[i]])
// te0[i], te1[i], te2[i] and te3[i] can be built from a single 32-bit value that is rotated.
res[0][i] = w
res[1][i] = w<<24 | w>>8
res[2][i] = w<<16 | w>>16
res[3][i] = w<<8 | w>>24
}
return res
}
```

So far, the high-level view of the AES encryption algorithm has been presented. Using only the FIPS-197 standard, a first implementation has been done. Then, the concept of T-tables has been presented, how to generate and how to use them. Encryption being done, we can move on to the decryption algorithm.

### Presentation of AES Decryption

Quite simply, the decryption process reverses every operation performed during encryption. The pre-requisites are the same as for encryption: we consider decrypting a 16-bytes block given an extended key and a number of rounds. The high-level decryption algorithm is:

```
// Decrypts a 16-bytes block src using AES decryption with
// extended key xk and for nr rounds.
// It returns the decrypted block.
func decrypt(src, xk []byte, nr int) []byte {
state := make([]byte, 16)
// Initialize the state with the ciphertext block
copy(state, src)
// Add the last round key to the state
AddRoundKey(xk[nr*16:(nr+1)*16], state)
for i:= nr-1; i > 0; i-- {
InvShiftRows(state)
InvSubBytes(state)
AddRoundKey(xk[i*16:(i+1)*16], state)
InvMixColumns(state)
}
InvShiftRows(state)
InvSubBytes(state)
AddRoundKey(xk[:16], state)
return state
}
```

Three of those inversion operations are easy to implement because the underlying operations are easy to reverse:

- AddRoundKey: This operations XORs the state with the round key, hence it reverses itself because
`c ^ k ^ k = c`

; - InvSubBytes: This operation substitutes every byte of the state using an inverse S-Box, such that
`IS[S[b]] = b`

; - InvShiftRows: This operation shifts rows to the right by a certain number of cells (During encryption ShiftRows rotates them to the left).

The InvMixColumns works the same as MixColumns: by left-multiplying the column to transform with a matrix. This matrix is a different one, which involves factors 9, 11, 13 and 14. Again, the multiplication is done in GF(256) and do not behave like the ‘usual’ multiplication. Pre-computed tables for those factors can be found in Wikipedia’s dedicated article. Given a column c `c0|c1|c2|c3`

, the computation of InvMixColumns yields a new column s `s0|s1|s2|s3`

:

```
s0 = 14*c0 ^ 11*c1 ^ 13*c2 ^ 9*c3
s1 = 9*c0 ^ 14*c1 ^ 11*c2 ^ 13*c3
s2 = 13*c0 ^ 9*c1 ^ 14*c2 ^ 11*c3
s3 = 11*c0 ^ 13*c1 ^ 9*c2 ^ 14*c3
s = s0 | s1 | s2 | s3
s = (14*c0 ^ 11*c1 ^ 13*c2 ^ 9*c3) | <- Replace each element with its value
(9*c0 ^ 14*c1 ^ 11*c2 ^ 13*c3) |
(13*c0 ^ 9*c1 ^ 14*c2 ^ 11*c3) |
(11*c0 ^ 13*c1 ^ 9*c2 ^ 14*c3)
s = (14*c0 | 9*c0 | 13*c0 | 11*c0) ^ <- Re-order terms
(11*c1 | 14*c1 | 9*c1 | 13*c1) ^
(13*c2 | 11*c2 | 14*c2 | 9*c2) ^
(9*c3 | 13*c3 | 11*c3 | 14*c3)
```

Using the same simplifications as for encryption, it is possible to create 4 T-tables that can be used to implement InvMixColumns using only lookups.

However, merging all those operations as was done for encryption is not directly possible. In particular, the round key must be added **before** the InvMixColumns operation, which poses a problem because the order of those operations matters. In order to go through the simplification for one column, we’ll use the column `c = d^k`

. `d`

represents a column `d0|d1|d2|d3`

of the state after InvSubBytes and InvShiftRows, while `k`

represents the round key `k0|k1|k2|k3`

that was added by AddRoundKey. We have:

```
s = (14*c0 | 9*c0 | 13*c0 | 11*c0) ^
(11*c1 | 14*c1 | 9*c1 | 13*c1) ^
(13*c2 | 11*c2 | 14*c2 | 9*c2) ^
(9*c3 | 13*c3 | 11*c3 | 14*c3)
s = (14*(d0^k0) | 9*(d0^k0) | 13*(d0^k0) | 11*(d0^k0)) ^ <- Replace terms
(11*(d1^k1) | 14*(d1^k1) | 9*(d1^k1) | 13*(d1^k1)) ^
(13*(d2^k2) | 11*(d2^k2) | 14*(d2^k2) | 9*(d2^k2)) ^
(9*(d3^k3) | 13*(d3^k3) | 11*(d3^k3) | 14*(d3^k3))
s = (14*c0 | 9*c0 | 13*c0 | 11*c0) ^ <- Separate column and key
(11*c1 | 14*c1 | 9*c1 | 13*c1) ^
(13*c2 | 11*c2 | 14*c2 | 9*c2) ^
(9*c3 | 13*c3 | 11*c3 | 14*c3) ^
(14*k0 | 9*k0 | 13*k0 | 11*k0) ^
(11*k1 | 14*k1 | 9*k1 | 13*k1) ^
(13*k2 | 11*k2 | 14*k2 | 9*k2) ^
(9*k3 | 13*k3 | 11*k3 | 14*k3)
s = InvMixColumns(d) ^ InvMixColumns(k)
```

This shows that it is possible, just like for encryption, to add the key **after** the computation of InvSubBytes, InvShiftRows and InvMixColumns, but we have to modify the round key `k`

into InvMixColumns(k). We’ll see how to deal with that key in a moment, for now here is how to include the InvSubBytes computation in tables in order to optimise the decryption algorithm. In the end, the T-tables td0, td1, td2 and td3 used for decryption are:

```
td0[i] = 14*IS[i] | 9*IS[i] | 13*IS[i] | 11*IS[i]
td1[i] = 11*IS[i] | 14*IS[i] | 9*IS[i] | 14*IS[i]
td2[i] = 13*IS[i] | 11*IS[i] | 14*IS[i] | 9*IS[i]
td3[i] = 9*IS[i] | 13*IS[i] | 11*IS[i] | 14*IS[i]
```

Using pre-computed values for the multiplication and an inverse S-box, the generation algorithm is given.

```
// Inverse S-box values
var invSbox [256]byte = [256]byte{ /* ... */ }
// Multiplication by 9 in Galois's field
var gmul9 [256]byte = [256]byte{ /* ... */ }
// Multiplication by 11 in Galois's field
var gmul11 [256]byte = [256]byte{ /* ... */ }
// Multiplication by 13 in Galois's field
var gmul13 [256]byte = [256]byte{ /* ... */ }
// Multiplication by 14 in Galois's field
var gmul14 [256]byte = [256]byte{ /* ... */ }
func generateDecryptionTable() [4][256]uint32 {
var w uint32
// res[0] is td0, res[1] is td1, res[2] is td2, res[3] is td3
var res [4][256]uint32
for i := 0; i < 256; i++ {
w = uint32(gmul14[invSbox[i]])<<24 | uint32(gmul9[invSbox[i]])<<16 | uint32(gmul13[invSbox[i]])<<8 | uint32(gmul11[invSbox[i]])
// td0[i], td1[i], td2[i] and td3[i] can be built from a single 32-bit value that is rotated.
res[0][i] = w
res[1][i] = w<<24 | w>>8
res[2][i] = w<<16 | w>>16
res[3][i] = w<<8 | w>>24
}
return res
}
```

In Golang’s AES implementation, the expansion key algorithm actually generates a pair of key: one for encryption and one suited for decryption. The extended decryption key that will be used can be derived from the extended encryption key: according to the initial decryption algorithm we need to apply InvMixColumns to the round keys `1..nr-1`

(inclusive). This is because the last round key is used for initialisation and the last round key is used on the final state, which didn’t pass through InvMixColumns.

Now that the T-tables are generated, it is possible to use them in the actual decryption algorithm, making it both shorter and faster. For simplicity sake, the algorithm to derive the extended decryption key from the extended encryption key is given in an additional function. However, Golang’s approach is nicer as it hides those different keys from the user.

```
var invSbox [256]byte = [256]byte { /* ... */ }
var sbox [256]byte = [256]byte { /* ... */ }
var td0 [256]uint32 = [256]uint32{ /* ... */ }
var td1 [256]uint32 = [256]uint32{ /* ... */ }
var td2 [256]uint32 = [256]uint32{ /* ... */ }
var td3 [256]uint32 = [256]uint32{ /* ... */ }
// createDecKey derives an extended decryption key from an expanded encryption key
func createDecKey(xk []uint32) []uint32 {
dec := make([]uint32, len(xk))
nr := len(xk) / 4
// Copy the first and last round key which do not change
for i := 0; i < 4; i++ {
dec[i] = xk[i]
dec[(nr-1)*4+i] = xk[(nr-1)*4+i]
}
// Apply InvMixColumn to all other round keys
for i := 4; i < (nr-1)*4; i++ {
// Our decryption T-tables perform InvMixColumn(IS[b]), we use
// the S-box to reverse the effects of the inverse S-box:
// InvMixColumns(S[IS[b]]) = InvMixColumns(b)
dec[i] = td0[sbox[xk[i]>>24]] ^ td1[sbox[xk[i]>>16&0xff]] ^ td2[sbox[xk[i]>>8&0xff]] ^ td3[sbox[xk[i]&0xff]]
}
return dec
}
// Decrypts a 16-bytes block src using AES decryption with
// extended key xk and for nr rounds.
// It returns the decrypted block.
func decrypt(src []uint32, xk []byte, nr int) []byte {
var tmp0, tmp1, tmp2, tmp3 uint32
// Initialize the state, stored in s0, s1, s2 and s3
s0 := uint32(src[0])<<24 | uint32(src[1])<<16 | uint32(src[2])<<8 | uint32(src[3])
s1 := uint32(src[4])<<24 | uint32(src[5])<<16 | uint32(src[6])<<8 | uint32(src[7])
s2 := uint32(src[8])<<24 | uint32(src[9])<<16 | uint32(src[10])<<8 | uint32(src[11])
s3 := uint32(src[12])<<24 | uint32(src[13])<<16 | uint32(src[14])<<8 | uint32(src[15])
// Add the last round key to the state
s0 ^= xk[nr*4]
s1 ^= xk[nr*4+1]
s2 ^= xk[nr*4+2]
s3 ^= xk[nr*4+3]
for i:= nr-1; i > 0; i-- {
// This performs InvShiftRows + InvSubBytes + AddRoundKey + InvMixColumns
tmp0 = td0[s0>>24] ^ td1[s3>>16&0xff] ^ td2[s2>>8&0xff] ^ td3[s1&0xff] ^ xk[4*i]
tmp1 = td0[s1>>24] ^ td1[s0>>16&0xff] ^ td2[s3>>8&0xff] ^ td3[s2&0xff] ^ xk[4*i+1]
tmp2 = td0[s2>>24] ^ td1[s1>>16&0xff] ^ td2[s0>>8&0xff] ^ td3[s3&0xff] ^ xk[4*i+2]
tmp3 = td0[s3>>24] ^ td1[s2>>16&0xff] ^ td2[s1>>8&0xff] ^ td3[s0&0xff] ^ xk[4*i+3]
s0, s1, s2, s3 = tmp0, tmp1, tmp2, tmp3
}
// Perform the last InvSubBytes + InvShiftRows + AddRoundKey
tmp0 = (uint32(invSbox[s0>>24])<<24 | uint32(invSbox[s3>>16&0xff])<<16 | uint32(invSbox[s2>>8&0xff])<<8 | uint32(invSbox[s1&0xff])) ^ xk[0]
tmp1 = (uint32(invSbox[s1>>24])<<24 | uint32(invSbox[s0>>16&0xff])<<16 | uint32(invSbox[s3>>8&0xff])<<8 | uint32(invSbox[s2&0xff])) ^ xk[1]
tmp2 = (uint32(invSbox[s2>>24])<<24 | uint32(invSbox[s1>>16&0xff])<<16 | uint32(invSbox[s0>>8&0xff])<<8 | uint32(invSbox[s3&0xff])) ^ xk[2]
tmp3 = (uint32(invSbox[s3>>24])<<24 | uint32(invSbox[s2>>16&0xff])<<16 | uint32(invSbox[s1>>8&0xff])<<8 | uint32(invSbox[s0&0xff])) ^ xk[3]
s0, s1, s2, s3 = tmp0, tmp1, tmp2, tmp3
// Unpack the state into the destination block. For performance purpose dst could be passed to the function
// rather than being allocated each time.
dst := make([]byte, 16)
dst[0], dst[1], dst[2], dst[3] = byte(s0>>24), byte(s0>>16), byte(s0>>8), byte(s0)
dst[4], dst[5], dst[6], dst[7] = byte(s1>>24), byte(s1>>16), byte(s1>>8), byte(s1)
dst[8], dst[9], dst[10], dst[11] = byte(s2>>24), byte(s2>>16), byte(s2>>8), byte(s2)
dst[12], dst[13], dst[14], dst[15] = byte(s3>>24), byte(s3>>16), byte(s3>>8), byte(s3)
return dst
}
```

### Conclusion

One last time, do not use any of the code presented here for anything serious. The inner working of the AES algorithm has been presented, along with an optimisation that is used in real-life implementations: T-tables. None of this is very difficult and there are several resources available on the Internet on this subject, but this optimisation is interesting nonetheless.

I initially looked at Golang’s implementation when the Matasano challenges asked to implement AES. I’ll keep working on them in order to improve my skills in cryptography, and will probably look at more implementations of crypto algorithms in the future.