It is often said that you should not create your own cryptographic algorithm. That in cryptography, one should not be original; that security is in the beaten track, and in particular those which are beaten by cryptographers. What is often forgotten is that using cryptography is also perilous. In this article we will discuss how to handle large files when decrypting, and how to do it securely on a Linux system. This article will be illustrated with Go code.
## What is a "large file" and how is it different from "small files"?
A small file is anything that is reasonable to store in RAM, in its entirety. RAM is a sanctuary where a program can store information. This sanctuary offers a relatively strong guarantee of isolation from interference from other tasks/processes.
Conversely, a large file is a file that cannot be stored in its entirety in the RAM of the system that will decrypt it. Therefore, interferences can occur. They are of several kinds.
### Modifications of the encrypted data
The first problem is that the encrypted data can be modified by an external program. Indeed, as soon as a file is visible in a directory, any program running with the necessary privileges can open a file descriptor on this file. Thanks to this file descriptor, it is then possible to modify the file. This includes programs running with the same privileges as the program that is in charge of decryption.
This is not necessarily a problem. The state of the art in applied cryptography is to use authenticated encryption. This type of encryption verifies the cryptographic integrity of the encrypted content, while it decrypts it. Some algorithms and encryption modes, such as AES-GCM or AES-OCB, perform these operations in "one pass". That is to say that the authenticity check is performed as the decryption proceeds and an authenticity verdict is given at the end of the decryption. Alas, not all encryption modes are one-pass; thus AES-CCM, for example, will first perform an integrity check (first pass) and then perform decryption (second pass) if the data was initially determined to be authentic. Unfortunately, an attacker is then able to alter the data between the first and second pass. The decrypted content is then declared authentic, even though it has been altered.
Consequently, when a one-pass decryption mode is not used, it is necessary to use a private copy of the data, either in RAM or by other system tricks that will be detailed later in this document.
### Early use of data whose authenticity is not yet assured
When the decrypted data is too large to be stored in RAM, it is necessary to write it to disk. Unfortunately, security problems can occur during this passage on disk.
The list of these problems cannot be exhaustive, but it is possible to think of certain software that would use the data in an anticipated way. Indeed, some software monitors the contents of a folder with [inotify(7)](https://man7.org/linux/man-pages/man7/inotify.7.html) ; this is the case of most file explorers.
This early reading can result in incorrect interpretations of the file. This is not the most unfortunate consequence, however. In the case of one-pass encryption, or integrity checking after decryption (see the OpenPGP section of this document), it is possible that the file containing the decrypted file contains malicious code added by an attacker who corrupted the encrypted document. It is indeed very important to understand that encryption alone is not sufficient to guarantee the integrity of a data, and that it is necessary to use an integrity pattern, and to verify it, before using the decrypted file. If the decrypted data is used before the authenticity verdict is given, then it is possible to exploit a vulnerability with a non-authentic document.
Therefore, it is essential that the decrypted data remains private until the authenticity verdict is received. When it is stored in RAM, this is easy, but when it must be stored on disk, in the case of large files, then it is necessary to exploit some defensive strategies, discussed later in this document.
## Size reduction by fragmentation
One possible strategy for encrypting/decrypting a large file may be to break it up into chunks that will fit in RAM.
It might be tempting to look at encryption modes used for encrypting large amounts of data, like XTS. In a nutshell, XTS performs encryption/decryption using a single key, but it uses an encryption process that is "tweaked" to each logical unit. XTS is often used for disk encryption, and in this case the logical unit is the disk "sector". Under Linux, it is possible to use XTS with dm-crypt and its LUKS overlay.
XTS presents a rather interesting advantage for disk encryption. Indeed, the tweak for each logical unit is an intrinsic information about the disk: the position of the sector on the disk. This subtlety means that it is not necessary to store additional data (the tweak) for each sector. So there is no storage expense for encryption related data!
Unfortunately, XTS offers limited integrity protection. Of course, if you take an encrypted sector and move it to another sector on the disk, the tweek of the algorithm will be incorrect. Indeed, if a sector is to be decrypted as sector X, and we try to decrypt it as sector Y, then the tweak will be incorrect, and the decryption will be meaningless. However, XTS does not protect against replacing sector X with an earlier version of sector X, for example.
Moreover, by splitting the disk into encrypted sectors, each with a different tweak, there is no protection against truncation. If one takes a disk of size X and copies it to a disk of size Y, with Y <X,thecryptographicalgorithmwillnotdetectthatdataismissing.
From the errors or constraints of disk encryption modes, several lessons can be learned. If the data to be encrypted/decrypted is too large to be stored in RAM, and the solution is to split it into chunks, then care must be taken to ensure that the integrity of the chunks is strong.
This integrity must ensure that :
* each section cannot be modified individually, even by replacement with a section of another encrypted message of comparable size;
* the order of the sections cannot be modified;
* it is not possible to add or remove sections without the entire large file being considered invalid.
Problem number 1 can be easily solved by using a separate encryption/decryption key per encrypted file, in combination with an algorithm and encryption mode that results in an authenticated encryption, such as AES-GCM.
Problem 2 can be easily solved by adding a counter to each encrypted data block. This represents a storage overhead that can be paid for when we are talking about file encryption and not disk encryption.
It might be possible to create a tweakable encryption mode that is also authenticated, for example by combining XTS and an HMAC. Alas, the consequence would be that the cryptographic operations would be in two passes (XTS then HMAC), which is a potentially unnecessary computational overhead if a better solution is available (and it is; see below :)).
Furthermore, XTS + HMAC would not protect against issue 3). Indeed, to counter 3), one method is to add the expected amount of chunks in the metadata of the encrypted file. This amount should be protected in integrity. This method is not original; it is used in the [Merkle-Damgård cryptographic construction](https://en.wikipedia.org/wiki/Merkle%E2%80%93Damg%C3%A5rd_construction), and is used in particular by the SHA hash algorithms.
All these additions are as many ways to make mistakes when performing the encryption and decryption steps. However, as stated in the chapter of this article, going off the beaten track is often synonymous with vulnerability.
Therefore, it would be better not to reinvent the wheel, and to use well-known cryptographic mechanisms and libraries to solve our large file problem.
## Cryptographic libraries
In Go, there are various high-level cryptographic libraries that are frequently used. Here I will talk about OpenPGP, which is problematic, and NACL, which is to be preferred.
### OpenPGP
OpenPGP is a fairly [old encryption standard](https://www.rfc-editor.org/rfc/rfc4880.txt). Its main implementation is GnuPG, and it continues to be the hobby of some misguided technicians. Yes, I'm thinking in particular of you Linux distributions.
These harsh words against this format are however deserved. OpenPGP is a museum of horrors, full of antiquated mechanisms, and cryptographic constructs from the infancy of authenticated encryption. Also, and not least, its implementers seem to have a passion for bad API ideas. In fact, the author of this article discovered [problems in most OpenPGP implementations in 2015](https://www.ssi.gouv.fr/uploads/2015/05/format-Oracles-on-OpenPGP.pdf), and some, in 2022, are still vulnerable to these findings... including GnuPG.
In Go, unsurprisingly, the OpenPGP implementation also contains some bad ideas. The package has even been [frozen and deprecated](https://github.com/golang/go/issues/44226), with the comment that it is not desirable for Go developers to use OpenPGP, as this format is "complex, fragile, and unsafe, and using it exposes applications to a dangerous ecosystem". To make the point, we will study one of its problems.
While it is true that it is fairly universal for data sources to implement [io.Reader](https://pkg.go.dev/io#Reader), it is possible to question the relevance of this choice for an encrypted data source whose integrity can only be verified after a complete pass.
One might expect the OpenPGP container [openpgp.MessageDetails](https://pkg.go.dev/golang.org/x/crypto/openpgp#MessageDetails) to perform this check on its own when instantiated with [openpgp.ReadMessage](https://pkg.go.dev/golang.org/x/crypto/openpgp#ReadMessage). This would be quite consistent with the `encoding/gzip` API whose [NewReader](https://pkg.go.dev/compress/gzip#NewReader) function returns an error if there are no "magic" bytes at the beginning of the read. Alas, as said before, OpenPGP is a museum of horrors, and it is not possible to check the integrity of the encrypted document; it is necessary to decrypt the entire encrypted document first, to finally recover an integrity tag. Indeed, with the OpenPGP standard, the integrity tab (a simple SHA-1 of the cleartext) is part of the encrypted data, and is suffixed to the cleartext. This approach is called [MAC-then-encrypt](https://en.wikipedia.org/wiki/Authenticated_encryption#MAC-then-Encrypt_(MtE)) and is decried by the cryptographic community.
Although the `io.Reader` of `openpgp.MessageDetails` is stored in the aptly named `UnverifiedBody` field, it is extremely tempting for a developer to plug it into another `io.Reader`, like a series of decorators, and forget or discover too late that the message was not genuine!
### NACL
[NACL](https://pkg.go.dev/golang.org/x/crypto/nacl) is an excellent cryptographic library, whose well-designed API allows only the most stubborn of idiots to make mistakes in its use. There are some command line tools to exploit it or its fork [libsodium](https://github.com/jedisct1/libsodium). One of them is the excellent [minisign](https://github.com/jedisct1/minisign) utility, by Frank Denis. The author of this article highly recommends minisign as a replacement for OpenPGP for signing documents!
There are implementations of minisign in Go, such as [go-minisign](https://github.com/jedisct1/go-minisign), which unfortunately suffers from the same problem of handling large files that we are dealing with in this article. Fortunately it is possible to use go-minisign even for large files by using the tricks presented in this article, below.
Coming back to NACL, the [box.Seal](https://pkg.go.dev/golang.org/x/crypto@v0.0.0-20211215153901-e495a2d5b3d3/nacl/box#Seal) and [box.Open](https://pkg.go.dev/golang.org/x/crypto@v0.0.0-20211215153901-e495a2d5b3d3/nacl/box#Open) functions have the particularity of not reading from an `io.Reader` and not writing to an `io.Writer`. So they do not fall into the crude trap at the bottom of which we find OpenPGP. These functions use byte slices. This could look like a blocking point. This article aims precisely at proposing a solution to circumvent this particularity, while offering a correct level of security.
## System tips and tricks to the rescue
### Control the release of data
As seen at the beginning of this article, it is important to control when the decrypted data is released; until the data is complete and verified, the working copy of the data must remain private. Since we are dealing with large files, which do not fit in RAM, it is necessary to store the working copy on the file system, while ensuring that no other process or task can access it. To do this, it is possible to use anonymous files.
Anonymous files are files that are stored on the file system without any links to them. By link, here, it is necessary to understand link in the sense "entry in a directory": a hardlink. These files are created by specifying the `O_TMPFILE` option to syscall [open(2)](https://man7.org/linux/man-pages/man2/open.2.html). Any byte written to such a file is actually stored on the file system, via the file descriptor returned by open(2) and known only to the program that created it (and to the processes that will be poking around in /proc... but they are looking for trouble ;)). It is therefore a private copy of the decrypted file. When the file is complete and its content verified, it is then possible to publish it through different ways.
One way to publish the file in a not very elegant way is simply to create a new file, without the `O_TMPFILE` option, and then to copy the contents of the decrypted file into this new file which is accessible by the other processes. The file descriptor can then be closed, and the anonymous file will be automatically freed. This method is expensive and has the drawback of doubling the disk size needed to store the decrypted file, at least temporarily, until the file descriptor of the anonymous file is closed.
A more elegant way, which takes advantage of a feature that is not always available, is to use [FICLONE](https://man7.org/linux/man-pages/man2/ioctl_ficlone.2.html) of the syscall [ioctl(2)](https://man7.org/linux/man-pages/man2/ioctl.2.html). `FICLONE` uses the copy-on-write (COW) functionality of some file systems, such as [btrfs](https://btrfs.wiki.kernel.org/index.php/Main_Page). With this syscall, it is possible to open a file with a hardlink and then request that the named file be a snapshot of the anonymous file. The two files will then share the same blocks of data on the file system, until one of them changes a block. But in this case, there will be no subsequent writing to the anonymous file after this call to ioctl(2). So this is simply a trick to link to the contents of the anonymous file, and thus publish it. The only drawback of this approach is that you have to use a file system that is compatible with `FICLONE`, and this is not the case with ext4, which is usually the default file system of Linux distrubutions.
Finally, there is a third method, also elegant, which does not take advantage of a particular feature of some file systems. Unfortunately, it requires certain system privileges to do so: [CAP_DAC_READ_SEARCH](https://man7.org/linux/man-pages/man7/capabilities.7.html). `CAP_DAC_READ_SEARCH` bypasses this file system protection, which is unfortunate, because it is also the privilege required to call the [linkat(2)](https://man7.org/linux/man-pages/man2/linkat.2.html) syscall, with the `AT_EMPTY_PATH` option. This syscall together with this option allows the creation of a link from a file descriptor. It allows giving a name to our anonymous file, once it is complete. It may be acceptable to give `CAP_DAC_READ_SEARCH` to our process, if it is running in a chroot in which this permission does not allow the program to gain or keep undue access to system resources. This solution is therefore probably acceptable under certain conditions, which must however be well controlled.
## Decrypting a large file in virtual memory
Not everything that is in virtual memory is necessarily physical memory. Thus, it is possible to obtain a Go slice containing the contents of a file, without it being copied into RAM. In the same way, it is possible to write in a slice, which is not stored in RAM, thanks to a syscall: [mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html). So we can call `box.Seal` and `box.Open` on such slices, and the result will have been computed without the content of the files being stored in RAM!
Unfortunately, things are never that "simple". There are some additional subtleties required when performing this write operation to a slice pointing to a file placed in virtual memory with mmap(2). Firstly, it is necessary that the destination file is the right size before calling mmap(2). To do this, a sparse file can be created, using the [fallocate(2)](https://man7.org/linux/man-pages/man1/fallocate.1.html) syscall. Then, once the writing to the slice is done, it is necessary to call the [msync(2)](https://man7.org/linux/man-pages/man2/msync.2.html) syscall in order to force the transfer of data from the virtual memory to the file, before making the call to munmap(2), to free the slice created by mmap(2).
Similarly, mmap(2) should used for the encrypted file, but there are some subtleties here as well. In particular, it is better to work on a private copy of the file, rather than one that can be altered by an external source. Indeed, the behavior is not specified if an external program truncates the file after it has been passed to mmap(2).
Finally, when passing the slice receiving the decrypted data to `box.Open`, one must pass this slice with [:0], in order to keep the capacity of the slice equal to the size of the file, but to force its length to 0. By doing this, `box.Open` will not proceed to reallocations of the array underlying the slice. It is indeed very important to use this trick, in order to continue working in the virtual memory returned by mmap(2) and not to end up working accidentally in RAM.
## Bringing it all together in a coherent program
To summarize everything that has been discussed in this article:
* when dealing with a large file, it is better to use Linux efficiently than to risk creating vulnerabilities by trying to truncate the file;
* it is important to always work with private copies of the data, both the encrypted and decrypted content;
* it is necessary to have control over the publication of the decrypted content, in particular to ensure that the content has integrity and is complete before making it available to third-party applications.
To accomplish these goals, it is possible to use the Linux syscalls mmap(2), fallocate(2), msync(2), and ioctl(2) or linkat(2).
[This git repository](https://github.com/X-Cli/large-file-decrypt) contains a library using all these elements to encrypt and decrypt a large file in a secure way.