Gimli: a cross-platform permutation

In a joint paper with Daniel J. Bernstein, Stefan Kölbl, Stefan Lucks, Pedro Maat Costa Massolino, Florian Mendel, Kashif Nawaz, Tobias Schneider, François-Xavier Standaert, Yosuke Todo, and Benoît Viguier, we designed and implemented Gimli, a 384-bit permutation that achieves high performance across a broad variety of platforms. For more details and the software and hardware implementations, see the Gimli website.

Kyber: a module-lattice-based KEM

In a joint paper with Joppe Bos, Léo Ducas, Eike Kiltz, Tancrède Lepoint, Vadim Lyubashevsky, John M. Schanck, and Damien Stehlé, we describe Kyber, a high-speed post-quantum-secure key-encapsulation mechanism (KEM), and (authenticated) key-exchange protocols based on module lattices. The software is available from the Kyber github repository.

A high-speed KEM from NTRU

In a joint paper with Andreas Hülsing, Joost Rijneveld, and John Schanck, we describe a high-speed post-quantum-secure key-encapsulation mechanism (KEM) based on NTRU. The software is available from Joost's website.

TESLA signature software

In a joint paper with Erdem Alkım, Nina Bindel, Johannes Buchmann, and Özgür Dagdelen, we describe a conservative, lattice-based signature scheme called TESLA. To download and build the software, run the following commands (you will have to install Python3 first):

tar xjvf tesla-20160930.tar.bz2
cd tesla-20160930/tesla-416/ref && make
cd ../avx2 && make
cd ../../tesla-768/ref && make
cd ../avx2 && make

This will build binaries called test_sign, testvectors and speed in the test/ subdirectories of each respective implemetnation. For a brief description of those binaries see the README.

Please note that versions previous to 20160930 had a subtle bug in the rounding. Please upgrade to the latest version.

gfverif: fast and easy verification of finite-field arithmetic

Together with Daniel J. Bernstein I am working on gfverif, a tool to verify the correctness of ECC software. For details please refer to the gfverif website.

"Newhope" Ring-LWE-based key exchange

In a paper with Erdem Alkim, Léo Ducas, and Thomas Pöppelmann, we describe a key-exchange protocol based on the Ring-Learning-with-Errors (Ring-LWE) problem. The paper describes two different software implementations: a C reference implementation and an optimized software implementation targeting Intel Haswell processors. To download and build the software follow these instructions:

tar xjvf newhope-20160815.tar.bz2
cd newhope-20160815/ref && make
cd ../avx2 && make
cd ../torref && make
cd ../toravx2 && make
This will build the binaries

The test_newhope binary performs various key exchanges and outputs nothing if they are successful. The test_statistical generates many keys without final hashing and prints statistics on the distribution of ones and zeroes in those keys. The testvectors binary runs many key exchanges with deterministic "random-number" generation to ensure compatibility of independent implemtenations and the speed binary performs benchmarks and prints results.

The torref and toravx2 implementations use a different, constant-time, generation of the polynomial a from a seed as described in the paper. The testvectors generated by those implementations are not compatible with the testvectors generated by ref and avx2.

Aside from the implementations of the key exchange, the software package also contains 5 Python scripts in the scripts subdirectory. For details of those scripts please refer to the paper.

A tiny hardware implementation of NaCl's crypto_box

In a joint paper with Michael Hutter, Jürgen Schilling, and Wolfgang Wieser, we describe an ASIC design for NaCl's crypto_box public-key authenticated encryption routine aiming at IoT devices like WISP nodes. The VHDL source code is in the public domain.

High-speed Curve25519 on embedded microcontrollers

In a joint paper with Michael Düll, Björn Haase, Gesine Hinterwälder, Michael Hutter, Christof Paar, Ana Helena Sánchez, and Peter Schwabe we describe high-speed Curve25519 software for AVR ATmega, TI MSP430, and ARM Cortex-M0 microcontrollers. The software is available on the μNaCl website:

SPHINCS: practical stateless hash-based signatures

In a joint paper with Daniel J. Bernstein, Daira Hopwood, Andreas Hülsing, Tanja Lange, Ruben Niederhagen, Louiza Papachristodoulou, and Zooko Wilcox-O'Hearn we describe a stateless hash-based signature system and software for a particular set of parameters that offers 128 bits of security against attacks that include attacks by quantum computers. The software is included in the SUPERCOP benchmark suite in subdirectory crypto_sign/sphincs256. For more details on SPHINCS, see also the SPHINCS website.

Fast multiprecision multiplication on AVR ATmega

In a joint paper with Michael Hutter we describe speed-record-setting multiprecision multiplication software for the AVR ATmega family of embedded microcontrollers. This software uses the Karatsuba multiplication technique and covers input sizes of 48, 64, 96, 128, 160, 192, and 256 bits. To download the software and build test routines for the ATmega2560, do the following:

tar xjvf avrmul-20150101.tar.bz2
cd avrmul-20150101/

For running the tests we use an Arduino Mega development board. There are two scripts that handle flashing test software and displaying output; both assume that the Arduino is connected at /dev/ttyACM0. The script test/ uploads the hex file test/test_mul_atmega2560.hex, which runs 1000 tests on pseudorandom inputs for each input size. The script test/ uploads the hex file test/stack_mul_atmega2560.hex, which measures and prints stack usage of the multiplication routines.

Vectorized genus-2 Diffie-Hellman

In a joint paper with Daniel J. Bernstein, Chitchanok Chuengsatiansup, and Tanja Lange we present speed records for 128-bit secure Diffie-Hellman key exchange using vectorized arithmetic on a Kummer surface. The software is included in the SUPERCOP benchmarking framework in the crypto_scalarmult/kummer subdirectory.

Formally verified Curve25519

In joint work with Yu-Fang Chen, Chang-Hong Hsu, Hsin-Hung Lin, Ming-Hsien Tsai, Bow-Yaw Wang, Bo-Yin Yang, and Shang-Yi Yang we formally verified hand-optimized assembly code performing the core arithmetic operation in Curve25519. More specifically, we verified the fully inlined Montgomery ladderstep of the two AMD64 implementations of Curve25519 presented in the Ed25519 paper. For details of the verification, see the paper.
To reproduce the verification procedure, you first need to install OCaml (version 4.x or higher), Coq, and Boolector with minisat support (note that the version of Boolector included in Debian does not have minisat support; also note that Boolector requires glibc-static, zlib-static, and libstdc++-static which are not installed by default on Fedora/Redhat):

sudo apt-get install ocaml
sudo apt-get install coq
tar xzvf boolector-1.6.0-with-sat-solvers.tar.gz
cd boolector-1.6.0-with-sat-solvers/
cd ..
Now download and unpack our verification archive and compile the qhasm-to-boolector translator:
tar xjvf verify25519-20140323.tar.bz2
cd verify25519-20140323/translator
cd ..
You see four subdirectories: To verify the correctness of an annotated qhasm file—for example the file annotated-qhasm/curve25519-5limb/fe25519_mul.q, run the following commands
./translator/qv.native -c annotated-qhasm/consts -s annotated-qhasm/curve25519-5limb/fe25519_mul.q
cd vc
for f in ./*;do ../../boolector-1.6.0-with-sat-solvers/boolector/boolector -minisat $f;done
To reproduce the coq proofs, do the following:
cd coq
coqc Modulo.v
coqtop < proposition-1.v
coqtop < proposition-2.v

If you are only interested in the btor input files and the heuristics as described in the paper, you can simply download the archive of btor files, which also includes a README with explanations.


TweetNaCl is a cryptographic library that offers the all the 25 NaCl functions used by applications and uses only 100 tweets. TweetNaCl is joint work by Daniel J. Bernstein, Wesley Janssen, Tanja Lange, and Peter Schwabe. For more information about TweetNaCl, see

NaCl for AVR microcontrollers

If you are interested in using NaCl for AVR, please refer to the most recent version of μNaCl.
The software listed below is still available as additional material accompanying the Africacrypt 2013 paper.

In the paper NaCl on 8-bit AVR microcontrollers we describe implementations of the cryptographic primitives used in the core of the Networking and Cryptography library (NaCl) for AVR microcontrollers. Specifically, the software implements the following functions from the NaCl API:

The library comes in two flavors, one optimized for speed, one optimized for small ROM usage. To build the high-speed version, do the following (for the small-ROM version everything works analogously):

tar xjvf avrnacl-20130514.tar.bz2
cd avrnacl-20130514/highspeed

This will produce the libnacl.a in the obj/ subdirectory and various tests in the test/ subdirectory for the ATmega2560 microcontroller. To run various unit tests on an ATmega2560, simply run


This will copy the testing binary to an Arduino MEGA (connected on /dev/ttyACM0) and run the tests. The results of the tests are sent through the serial device and read and evaluated automatically.
Similary, to perform speed benchmarks, run


and to run stack-usage benchmarks, run


Practical post-quantum (lattice-based) signatures

The paper Software speed records for lattice-based signatures describes an implementation of lattice-based signatures at a 100-bit security level. The software is available either as submission to the eBACS benchmarking project or as standalone package. The link to the eBACS submission is
To download and build the standalone package, do the following:

tar xjvf lattisigns512avx-20130329.tar.bz2
cd lattisigns512avx-20130329

This will produce two binaries in the test/ subdirectory: test_sign performs various tests and prints zeros if they are successful; speed will perform various performance measurements. Note that in order to build and run the software you will need a processor that supports the AVX vector-instruction set.

Fast cryptography for ARM CPUs with NEON vector instructions

The paper NEON crypto describes high-performance software of various symmetric and asymmetric cryptographic primitives for the ARM Cortex-A8 processor and other CPUs that support the NEON vector instruction set. Most of the software (all but the Ed25519 signatures) are included in the SUPERCOP benchmarking suite since version 20120918. Different version of the Salsa20 software can be found in the subdirectories crypto_stream/salsa20/armneon, crypto_stream/salsa20/armneon2, crypto_stream/salsa20/armneon3, and crypto_stream/salsa20/armneon6. The Poly1305 authentication software is in the subdirectory crypto_onetimeauth/poly1305/neon2. The Curve25519 ECDH software is in the subdirectory crypto_scalarmult/curve25519/neon2.
The Ed25519 digital-signature software is going to be included in SUPERCOP, soon. We are planning to include all these implementations in the next release of the Networking and Cryptography Library (NaCl).

ARM11 implementations of all SHA-3 finalists

The paper SHA-3 on ARM11 processors describes optimized assembly implementations of the 256-bit-output versions of all SHA-3 finalists and of SHA-256. These implementations are included in the arm11 subdirectories of the respective hash function implementations in SUPERCOP since version 20111120. For example the implementation of Blake256 is in the subdirectory crypto_hash/blake256/arm11/.

NaCl: Networking and Cryptography library

The NaCl library is designed to offer high-speed high-security cryptographic protection of network communication. Initial development of the library was done as part of the EU FP7 project Computer Aided Cryptography Engineering (CACE). Further development of the library is part of the European Network of Excellence in Cryptology II (ECRYPT II).
For information about the library, see

Ed25519: high-speed high-security signatures

The paper High-speed high-security signatures describes a public-key signature scheme with an implementation that sets new speed records for (batched) verification of elliptic-curve signatures. We put the software described in the paper into the public domain. It can be found in the SUPERCOP benchmarking suite from version 20110629 in the crypto_sign/ed25519/ subdirectory. The software will be furthermore be included in future versions of the NaCl library. For more details also see Daniel J. Bernstein's Ed25519 website.

Warning: Early versions of the software had a bug in the carry chain after multiplication and squaring in the amd64-64-24k implementation. Tests with random inputs will not catch the bug, it is triggered with with very low probability. This bug is in the software contained in SUPERCOP up to version 20130419.
The bug also affects the crypto_scalarmult/curve25519/amd64-64 implementation in SUPERCOP up to version 20130419.

RFSB509: really fast syndrome-based hashing

In the paper Really fast syndrome-based hashing we describe a family of collision-resistant compression functions that can be used as the core building block for fast and secure cryptographic hash functions. The paper also describes a software implementation that uses a particular choice of parameters targeting the 128-bit security level. This implementation of RFSB509 is included in SUPERCOP in the subdirectory crypto_hash/rfsb509.

DCLXVI: Pairing computation on AMD64 processors

In the paper New software speed records for cryptographic pairings we describe an implementation of cryptographic pairings over a Barreto-Naehrig curve that sets new speed records for cryptographic pairings on many amd64 processors. We put the software into the public domain.
To download and build the latest version of the software do the following:

tar xjvf dclxvi-20130329.tar.bz2
cd dclxvi-20130329/

This will produce 10 binaries: bilintest-check, bilintest-c, bilintest-as, speedtest-check, speedtest-c, and speedtest-as, test_curvepoint_multiscalar-as, test_curvepoint_multiscalar-check, test_twistpoint_multiscalar-as, test_twistpoint_multiscalar-check.
The binaries ending on -check perform all arithmetic with included overflow check as described in the paper. The files ending on -c are non-optimized implementations in C and the implementations ending on -as have all speed-critical functions implemented in qhasm.
The bilintest binaries perform NTESTS bilinearity (and non-degeneracy) tests, the value NTESTS is defined in the Makefile. The speedtest binaries measure the cycles counts of the most speed-critical functions.
Since version 20130329, DCLXVI also has faster (but not timing-attack protected!) functions for scalar multiplication and multi-scalar multiplication on the curve and the twist. The test_curvepoint_multiscalar and test_twistpoint_multiscalar binaries perform simple tests of this multi-scalar multiplication on curve and twist, respectively.

Caution: The software as described in early versions of the paper (versions 2010-05-28 [pdf] and 2010-04-06 [pdf]) has a bug: The parameter u=1966080 used to construct the curve we use for the pairing does not generate a Barreto-Naehrig curve; the order of the group E(Fp) is not prime. We thank Francisco Rodríguez- Henríquez and Jean-Luc Beuchat for pointing out this bug.
The software from version 20100618 now uses a different curve generated with u=1868033, the speed of this updated software is similar, actually even slightly faster.
For verification of the performance numbers the old version of the software is still available as dclxvi-notsecure.tar.bz2; do not use this software for cryptographic purposes.

Update: An even faster pairing implementation has been presented by Jean-Luc Beuchat, Jorge Enrique González Díaz, Shigeo Mitsunari, Eiji Okamoto, Francisco Rodríquez-Henríquez and Tadanori Teruya.
But our software is a bit (read: one bit) more secure.

FSBday: a parallel implementation of Wagner's generalized birthday attack

In the paper FSBday: Implementing Wagner's generalized birthday attack against the SHA-3 round-1 candidate FSB we describe an implementation of Wagner's generalized birthday attack that runs on a cluster of 8 computers to find a collision in the compression function of the toy version FSB48 of the hash function FSB. The requirements to run this code are the following:

To run the attack on this cluster perform the following steps (see also the FSBday website):
Installing MPICH-2

The implementation uses the MPI implementation MPICH-2. We followed this installation guide to set up MPICH-2 on our cluster machines running Ubuntu 9.04 (aka Jaunty Jackalope).
As a remark: We had problems with the setup of MPI when the hostname was mapped to in /etc/hosts, we just removed this entry.

Downloading and running our code

To download and compile the code run

tar xjvf fsbday-20091213.tar.bz2
cd fsbday-20091213

If the data partition for the attack is not /dev/sda1, change the definition of IODEVICE in params.h accordingly before compiling, otherwise running the attack may destroy data on /dev/sda1!

This will produce a binary called fsbday, which you have to copy to each of the cluster nodes in the same directory. For this you can use the bash script

Running Phase 1 (Determine clamping constants)

The first phase of the attack (for the description of the two phases please read the paper) is started through the bash script, before running the script you might have to change the value of the HOSTFILE (the path to the MPI host file, set to ./hosts by default) in the script. The script takes as only argument the path to a logfile, so the first phase of the attack can be started with

./fsbstart fsbday.log

The only output on stdout is the clamping constants tried and whether they lead to success, more information including the values yielding a collision can be found in the log file.

Running Phase 2 (Compute colliding matrix positions)

To determine the matrix positions from the clamping constants you have to run the program fsbfinal twice, once for the left and once for the right half-tree. The fsbfinal program needs as input the clamping constants used to generate the lists on level 0 and the value on level 3 that leads to success. The clamping constants are output on standard output, the value you will find in the logfile n a line looking like

0: Result: 54e28b37d58e75 781538a2cdf5639d Part: 1f8

when you grep for "Result". The first number (before the colon) is the number of the node.
Let's assume the clamping constants in the rightmost quarter-tree are you determined in phase 1 are 4, 4, 0, 0, and let's assume the value as in the line above, then you first have to convert 54e28b37d58e75 to decimal, yielding 23892985608769141, same for 781538a2cdf5639d yielding 8652884530953544605 and for the part 1f8 yielding 504.
You then run fsbfinal as follows

mpirun -l -machinefile hosts -np 8 ./fsbfinal --nlists=8 --startlevel=0 --startlist=0 --cmp=23892985608769141 --remaining=8652884530953544605 --part=504 --node=0 0 0 0 0 0 0 0 0

mpirun -l -machinefile hosts -np 8 ./fsbfinal --nlists=8 --startlevel=0 --startlist=8 --cmp=23892985608769141 --remaining=8652884530953544605 --part=504 --node=0 0 0 0 0 0 0 4 4

This will print the positions (relative positions in the according blocks and half-blocks) to stderr.

For information about how to change to code to other attack scenarios please refer to the FSBday website.

A much more efficient way to find collisions in the full hash function FSB48 is Pollard's rho algorithm, i.e. a generic attack that works against any hash function. To download and run the code of such a simple attack against the full FSB48 hash function do the following:

tar xjvf fsb48-pollard.tar.bz2
cd fsb48-pollard

Cleaned-up implementation of FSB-256

Based on the reference implementation for ebash, I implemented a somewhat cleaned-up, faster version of FSB-256.

Curve25519 for the Cell Broadband Engine

In joint work with Neil Costigan I implemented a version of the elliptic-curve Diffie Hellman key exchange software curve25519 for the Cell Broadband Engine. There is a paper describing the details of the implementation.
The software is available for download; it's following the eBATS API.

Cache-timing resistant AES-GCM for 64-bit Intel processors

Emilia Käsper implemented very fast and cache-timing-attack resistant AES in counter mode. I ported this code to qhasm and we jointly implemented AES-GCM based on this fast AES code. For details of the implementation please refer to the paper "Faster and Timing-Attack Resistant AES-GCM".
The code follows the eSTREAM API, the following versions are avaible:

The assembly versions of the code are available on Emilia's website. The qhasm version uses macros, so you will need maq, a preprocessor for qhasm, to rebuild the code (the resulting assembly file is included).
You will furthermore need an extended AMD64 machine-description file for qhasm, it just goes to the qhasm directory.

AES implementations

D. J. Bernstein and I implemented AES in Counter mode for different architectures, namely ppc32, x86, sparcv9 and amd64. For several microarchitectures this code is setting new speed records. Details about how these speed records are achieved can be found in our joint paper "New AES software speed records" (PDF).
The software is now available as part of the estreambench toolkit (in the subdirectory /benchmarks/aes/aes-128/schwabe). We have placed the software into the public domain; feel free to integrate it into your own AES applications!

HECTOR: fast hyperelliptic-curve-based key exchange

HECTOR is an implementation of a key exchange and a signature scheme based on a hyperelliptic curve over the binary field F2113 for the eBATS project. HECTOR is joint work with Peter Birkner. For details of the implementation please refer to the documentation in pdf format.
To download and build HECTOR, first install the GMP library, then do the following:

tar xjvf hector.tar.bz2
cd hector

This builds two binaries, hector_dh and hector_sig, both need random bytes as input, they can for example be started with

./hector_dh < /dev/urandom
./hector_sig < /dev/urandom


BN curves

A BN curve is an elliptic curve E: Y2=X3+b defined over Fp such that the group of Fp-rational points has prime order n. BN curves have embedding degree k=12 with respect to n. The BN family of pairing-friendly curves is parametrized by the following polynomials in the indeterminate u:


The trace of Frobenius over Fp is given as t(u)=6u2+1. To find a BN curve just choose integer values of the suitable size for u until both p(u) and n(u) are prime. Then test values for the coefficient b until the curve has the correct order. A generator of the group E(Fp) is any nonzero Fp-rational point, but it can be chosen to be G=(1,y) for y a square root of b+1. For more detailed information see [1]. BN curves have sextic twists and are therefore suitable for very efficient pairing implementation. It is also possible to compute pairings on BN curves in a compressed form, see [2].

[1] Paulo S. L. M. Barreto, Michael Naehrig, Pairing-Friendly Elliptic Curves of Prime Order, Selected Areas in Cryptography -- SAC'2005, Lecture Notes in Computer Science 3897, Springer-Verlag (2006), pp 319--331. Preliminary version: Cryptology ePrint Archive, Report 2005/133.
[2] Michael Naehrig, Paulo S. L. M. Barreto, Peter Schwabe: On compressible pairings and their computation,
Progress in Cryptology - AfricaCrypt 2008 , Lecture Notes in Computer Science 5023, Springer-Verlag (2008), pp 371--388.
Cryptology ePrint Archive, Report 2007/429.

Pairing computation on BN curves

bn256 is an implementation of different cryptographic pairings on a 256 bit BN curve.
In order to build it, you need the GMP library with header files and makedepend, for Debian (Lenny) systems these can be installed with

aptitude install libgmp3-dev libgmp3c2 xutils-dev

For speed measurement we include cpucycles written by D. J. Bernstein.

To build bn256, do the following:

tar xjvf bn256-20090824.tar.bz2
cd bn256-20090824

After the build process has finished you find a binary called bn256 in the bin/ directory, whose usage should be self-explanatory. From the corresponding source file src/bn256.c you can see how to call the functions for pairing computation.
We note, that the generation of random points is based on the GMP functions for random number generation and must not be used for cryptographic purposes.
Parts of this code have been developed in the context of a UMIC project at the Institute for Theoretical Information Technology at RWTH Aachen University.