13. Persistence

RDFox® is a main-memory system, which means that a full copy of the data loaded into the system is always held in main memory to support query answering and other operations. It is, however, possible to configure RDFox to incrementally save various types of information to persistent storage so that it may be restored in a subsequent or concurrent session.

There are three reasons to enable persistence. The first is when RDFox is being used as the System of Record for the data it contains. In this case, persisting to disk is essential in order to achieve acceptable levels of data durability. The second reason is to improve restart performance. Using persistence achieves this because it is generally faster for RDFox to reload data stored in its native format than it is to import it from other formats, or to copy it from another data store. The third reason is so that changes to the data store can be continuously replicated amongst several RDFox instances. Only the file-sequence persistence type described in Section 13.2.2 supports this last goal.

When persistence is enabled, care must be taken to ensure that sufficient disk space is available for RDFox to persist new data as well as ensuring that there is sufficient memory.

RDFox’s persistence is compliant with the ACID properties described in Section 11.

Note

Persistence does not increase the capacity of an RDFox data store beyond what can be stored in memory.

The persisted state of an RDFox instance consists of the following types of information:

  • information about all data stores in the system (also known as the server catalog),

  • information about the roles used for access control, and

  • the data loaded into each data store.

Each version of the server catalog is uniquely identified by a server version number, and each distinct state of a data store us uniquely identified by a data store version number (see Section 4). In addition, each distinct list of roles used for access control is uniquely identified by a unique role manager version number.

13.1. Key Concepts

Data store persistence consists of a sequence of changes, each of which is a snapshot or a delta. A snapshot contains a full copy of the entire content of the data store, whereas a delta contains only the incremental changes introduced in a single transaction. The time and disk space needed to save a snapshot are proportional to the size of the full data store content, whereas the time and disk space needed to save a delta are proportional to the size of the change since the previous version. Persisted data of a data store consists of an initial snapshot, followed by a sequence of deltas and snapshots that detail the content of further transactions applied to the data store. Snapshots and deltas are created automatically by RDFox on any data store that uses persistence as follows.

  • When a transaction is committed and auto-compaction is not required (see Section 5.2.4 and Section 5.4.2.1), a delta describing the changes in the transaction is saved.

  • When a data store is compacted (either explicitly or automatically), a new snapshot of the data store is saved. Depending on how exactly persistence is configured and how compaction is performed, this snapshot either replaces all existing persisted data, or it is saved in addition to all existing changes. Either way, when the RDFox server is restarted, data store loading will start from the most recently saved snapshot. Thus, compacting a data store has the potential to reduce both the disk space used and the time needed to start an RDFox instance.

The server catalog and the list of available roles are typically small, so they are always saved as snapshots.

13.2. Configuration

In order to use any form of persistence, an RDFox server must be configured with a server directory, which will be used to save and load persisted data. Persistence is controlled by the persistence server parameter, which can be set to off (no persistence), file, or file-sequence. The file and file-sequence options are described in the following sections. Even when persistence is enabled at the server level, it may be disabled at the data store level by setting the persistence data store parameter to off.

13.2.1. file persistence

The file persistence option stores the persisted content (the data store catalog, the list of available roles, and the content of a data store) in a single file. The data store catalog and the list of available roles are always saved as a snapshot. The content of a data store is saved as a snapshot followed by zero or more deltas. When a data store is compacted, the saved data is replaced by a fresh snapshot, which is extended by deltas as further transactions are committed on the data store. The process of compacting a data store first saves the current snapshot into a new file, and then it atomically replaces the old file with the new file. Consequently, compacting a data store eventually frees the storage occupied by any deltas written after the snapshot, but it may temporarily use additional disk space in order to hold both the old and the new files.

A server directory containing this persistence type can only be used by one RDFox server at a time. RDFox ensures that this is the case by seeking an exclusive lock on the directory when the server instance is created and exiting if the lock cannot be obtained.

RDFox will update the persisted data in a way that is in most cases resilient to RDFox crashing or the system losing power during the saving process. Specifically, if saving of a delta is interrupted for any reason, RDFox will undo any changes made to the data the next time the RDFox process is restarted; in this way, RDFox provides ACID guarantees for transaction updates.

While RDFox guarantees consistency of persisted data due to RDFox crashes and power failure, RDFox is not immune against external damage to persisted files. RDFox will attempt to detect such corruption as follows.

  • When data is encrypted, the encryption algorithm itself offers protection against corruption: decrypting a damaged file will produce data that has a very high chance to be detected by RDFox as invalid.

  • When data is not encrypted, RDFox will use the CRC64 checksum algorithm to detect data corruption.

RDFox will refuse to start if corruption is detected in any part of the persisted data. In such cases, the only possible course of action is to restore a recent state of the RDFox database from backup. Consequently, it is highly recommended to create periodic backups of the entire server directory. It is safe to create copies of the server directory even if an RDFox instance is running, provided that the RDFox instance does not write any data during the backup period. If RDFox tries to save a transaction or change the server catalog while a backup is in progress, the backed up data may be invalid (i.e., it cannot be used in future to restore the state of an RDFox server). Consequently, backups of the server directory should only be taken during a maintenance window in which no read/write transactions are performed.

13.2.1.1. System Requirements

In order to exclusively lock the server directory , file persistence uses the flock system call with the LOCK_EX flag on Linux, and the CreateFileW system call with the dwShareMode set to 0 on Windows. In both cases, the underlying file system must faithfully and correctly support the locking semantics of those calls.

Correctness of file persistence relies on the following important system-level considerations.

  • To guard against sudden power failure, RDFox writes data in multiples of disk sector size. However, determining the sector size programmatically typically requires administrative privileges in modern OSes. Consequently, RDFox relies on users to configure the sector size correctly. RDFox will function correctly as long as its sector size is a multiple of actual sector size; however, using a sector size that size that is larger than what is strictly necessary for the disk may waste a very small amount of storage per transaction. Most disks available nowadays on the market use sectors of 512 or 4096 bytes, so RDFox uses a sector size of 4096 by default as this ensures correctness on commonly used hardware. If RDFox is used on a disk with a different sector size, the correct sector size must be set explicitly using the persistence.disk-sector-size server parameter.

  • RDFox relies on system calls that ensure that the data is persisted on disk (FlushFileBuffers on Windows, fcntl with the F_BARRIERFSYNC option on macOS, and fsyncdata on Linux). It is well documented that certain disks and disk drivers will “lie” to the operating system; for example, some disks will report that the data has been fully persisted even if the data has not yet been flushed from the disk controller’s cache. Modern operating systems do not provide a way of detecting such situations, and so RDFox has no choice but to “believe” the operating system. If RDFox is used with a disk that “lies” about persistence, data can be lost in case of unexpected power failure or kernel crash. Please check with your disk’s manufacturer whether their product is safe to be used in a transactional application.

  • On macOS, RDFox uses the fcntl with the F_BARRIERFSYNC option to synchronize data with external storage. This system call is well known to not offer hard persistence guarantees, and in fact it was observed in practice that the data can be kept in disk buffers for a few seconds after the system call is issued. The F_FULLFSYNC option offers stronger persistence guarantees, but is known to cause considerable slowdown and can introduce considerable wear and tear with Apple’s SSDs; moreover, even that system call does not completely guarantee no data loss in case of power failure. Please refer to Apple’s documentation about these system calls and their recommendation to use F_BARRIERFSYNC. Consequently, persisted data is not 100% safe from power failure on macOS. However, in our experience, Mac computers are rarely used to run production-grade databases, and moreover Mac laptops (which are the most common form of Mac computers) are equipped with a battery that considerably reduces the chances of sudden power failure. Thus, relaxing consistency in order to improve performance and reduce wear and tear is acceptable in typical usage scenarios of RDFox on macOS. Please contact Oxford Semantic Technologies if you plan to use RDFox in production on macOS.

13.2.2. file-sequence persistence

The file-sequence persistence option stores the persisted content (whether it is the data store catalog, roles database, or the content of a data store) in a sequence of files with one file per version. The path of each file is determined by the relevant version number (server, role manager or data store).

Unlike with file persistence, a server directory using file-sequence persistence may be shared by several RDFox servers at once. So long as the underlying file system meets the criteria described in Section 13.2.2.1, any modification successfully made via any of these instances will be replicated to the other instances eventually. This provides the basis for deploying RDFox in a leaderless, high availability (HA) configuration. Please refer to Section 13.2.2.2 for information about replication lag and Section 21 for details of how to setup HA deployments.

Also unlike the file persistence type, file-sequence server directories can be backed up during write operations, removing the need for maintenance windows in order to collect backups. This is because the files that form the file sequence never contain partial deltas.

When data stores persisted with this file type are compacted, old snapshots and deltas are not automatically deleted from the disk. This is because deleting the files that contain these records could cause other RDFox instances restoring the file sequence to diverge from the committed version history. For example, if an instance has not yet restored content from the file corresponding to version v, where v is less than the current data store version, and the file is deleted, that instance will restore versions up to v-1 and then become available to accept writes. Since the path reserved for version v is empty, a write via this instance will succeed, creating a divergence in the version history that could lead to data loss.

The problem described above cannot happen if all of the running instances that share a server directory have the globally highest version of a data store when a compaction on the store begins. To allow disk space to be freed for data stores configured to use file-sequence persistence, compact supports an optional argument indicating that redundant files should be deleted. Before using this option, operators must take measures to ensure that all running instances are consistent with respect to the data store version. This could be achieved by shutting down all but one of the instances or by blocking all inbound write requests and waiting until all instances report the same version number for the data store. In this latter case, it is advisable to keep the block on write traffic in place until the compaction operation has completed.

Warning

The option to delete redundant files during a compaction can corrupt your data store if other replicas are in the process of restoring from a file sequence.

Note

In order to use the file-sequence persistence option, the license key provided at startup must explicitly support the feature.

Note

The file-sequence persistence option is EXPERIMENTAL on Windows and macOS.

13.2.2.1. System Requirements

In order to make it safe for any of the RDFox instances sharing a server directory to persist new transactions, each instance must add files to file sequences in a way that will fail if the version they are trying to create has already been created by another instance. This requires that the file system containing the server directory supports an atomic move or link operation that returns a distinct error code when the target path is already occupied.

The Network File System (NFS) protocol meets the above requirement through the LINK operation (documented, in Section 18.9 of RFC5661 <https://datatracker.ietf.org/doc/html/rfc5661#section-18.9>). Oxford Semantic Technologies has successfully tested the file-sequence persistence option under sustained write contention on the following managed NFS services:

In each case, testing was performed by provisioning an instance of the file system and mounting it to three Linux hosts in separate availability zones using the mounting procedure recommended by the service provider. An instance of RDFox was then started in shell mode on each host with the server-directory parameter pointing to the same directory on the NFS file system. Using one of the instances, a data store was created with the following command:

dstore create default

Next, the following commands were run on each host:

set output out
rwtest 500 5000

The rwtest shell command has been designed specifically to detect replication errors and is described fully in Section 15.2.39. When invoked as shown, the test attempts to commit a transaction every 2.75 s on average. Running the command on three instances simultaneously results in frequent write contention events.

After 72 hours, the rwtest command was interrupted by issuing Ctrl-C in the terminal on each host. This produces a final report that shows the total number of transactions successfully committed by each instance. The sum of these numbers was found to match the data store version minus 1 (the initial data store version) as expected. If more than one of the instances had concluded that it had successfully created any given version, the sum of these numbers would be higher. If in any iteration of the test loop on any of the three instances the content of the data store differed from the expected content, which is known for each data store version, the test would have stopped with an error.

The above procedure constitutes a minimum test for qualifying file systems (and the associated configuration options) for production use in scenarios where write contention may occur. Users planning deployments of RDFox that use the file-sequence persistence option are advised to conduct their own testing using this procedure. The degree of write contention can be varied in the test by changing the numeric parameters to the command which represent the minimum and maximum duration in milliseconds between iterations of the test loop.

It is worth noting that the atomic operation described above is only required in situations where there is a risk of write contention and that a broader range of file systems may be safe to use under the constraint that write contention will not occur. This can be achieved by ensuring (externally) that all writes will be committed via the same nominated instance. Some approaches to this are reviewed in Section 21.3. To qualify file systems for use in such setups, the rwtest command can be invoked with the read-only argument on all but one of the hosts.

Note

On UNIX operating systems, RDFox uses the link system call as the atomic commit operation, while on Windows MoveFileW is used. The EEXISTS (UNIX) and ERROR_ALREADY_EXISTS (Windows) error codes are interpreted to mean that the commit has failed because another instance successfully committed a change first.

13.2.2.2. Replication Performance

In order for a change to be replicated between instances, the instance writing the change must successfully commit it to the path reserved for the new version number, and the other instances must then discover it and apply the changes to their own in-memory copies of the data. The time taken for this process is called replication lag.

In all, there are three mechanisms by which a receiving instance can discover new version files. The first is through polling. The poll interval is controlled by the file-system-poll-interval server parameter which has a default of 60 s. A separate poll interval is established for each persisted component. This means that, in the case of a server with three persisted data stores, the file system will be polled five times in each interval: once for new server versions, once for new role manager versions, and once for each of the three data stores. It is generally desirable to keep polling intervals long to minimise load on the file system so, while this mechanism helps bound worst-case replication lag, it is unsuitable for achieving low average-case replication lag.

The second mechanism by which new version files are discovered is when a commit fails because the local version has fallen behind the highest persisted version. In this case, the instance will apply the new version and any others that have been created since as soon as the failed transaction has been rolled back. This mechanism is provided to reduce the time taken for an instance to catch up with the latest changes after a commit fails due to write contention. It will not, in most cases, be useful for achieving low average-case replication lag.

The third mechanism for new version files to be discovered is by notification over UDP. This mechanism gives the lowest possible average-case replication lag. To activate this mechanism, the notifications-address server parameter must be set to a host name and UDP port number separated by a plus (+) symbol. An instance configured this way will register itself for notifications by writing the given notifications address to the server directory itself. For the mechanism to work, the host name must be resolvable by the other instances that share the server directory and UDP packets must be able to flow freely from the other instances to the specified port. With this mechanism in place, it should be possible to achieve sub-second replication lag for instances within the same data center assuming that changes are small.

13.2.3. Encryption

When using any form of persistence, a RDFox server can be configured to encrypt and decrypt the data stored in the server directory by supplying a base64-encoded key via the persistence.encryption-key server parameter. By default the AES-256-CBC cipher, which requires a 256-bit AES key, is used but this can be changed by setting the persistence.encryption-algorithm parameter. See the full documentation of the above parameters in Section 4.3 for more details.