13. Persistence¶
RDFox is a main-memory system, which means that a full copy of the data loaded into the system is always held in main memory to support query answering and other operations. It is, however, possible to configure RDFox to incrementally save various types of information to persistent storage so that it may be restored in a subsequent or concurrent session.
There are three reasons to enable persistence. The first is when RDFox is being
used as the System of Record for the data it contains. In
this case, persisting to disk is essential in order to achieve acceptable
levels of data durability. The second reason is to improve restart performance.
Using persistence achieves this because it is generally faster for RDFox to
reload data stored in its native format than it is to import it from other
formats, or to copy it from another data store. The third reason is so that
changes to the data store can be continuously replicated amongst several RDFox
instances. Only the file-sequence
persistence type described in
Section 13.2.2 supports this last goal.
When persistence is enabled, care must be taken to ensure that sufficient disk space is available for RDFox to persist new data as well as ensuring that there is sufficient memory.
RDFox’s persistence is compliant with the ACID properties described in Section 11.
Note
Persistence does not increase the capacity of an RDFox data store beyond what can be stored in memory.
The persisted state of an RDFox instance consists of the following types of information:
information about all data stores in the system (also known as the server catalog),
information about the roles used for access control, and
the data loaded into each data store.
Each version of the server catalog is uniquely identified by a server version number, and each distinct state of a data store us uniquely identified by a data store version number (see Section 4). In addition, each distinct list of roles used for access control is uniquely identified by a unique role manager version number.
13.1. Key Concepts¶
Data store persistence consists of a sequence of changes, each of which is a snapshot or a delta. A snapshot contains a full copy of the entire content of the data store, whereas a delta contains only the incremental changes introduced in a single transaction. The time and disk space needed to save a snapshot are proportional to the size of the full data store content, whereas the time and disk space needed to save a delta are proportional to the size of the change since the previous version. Persisted data of a data store consists of an initial snapshot, followed by a sequence of deltas and snapshots that detail the content of further transactions applied to the data store. Snapshots and deltas are created automatically by RDFox on any data store that uses persistence as follows.
When a transaction is committed and auto-compaction is not required (see Section 5.2.4 and Section 5.4.2.1), a delta describing the changes in the transaction is saved.
When a data store is compacted (either explicitly or automatically), a new snapshot of the data store is saved. Depending on how exactly persistence is configured and how compaction is performed, this snapshot either replaces all existing persisted data, or it is saved in addition to all existing changes. Either way, when the RDFox server is restarted, data store loading will start from the most recently saved snapshot. Thus, compacting a data store has the potential to reduce both the disk space used and the time needed to start an RDFox instance.
The server catalog and the list of available roles are typically small, so they are always saved as snapshots.
13.2. Configuration¶
In order to use any form of persistence, an RDFox server must be configured
with a server directory, which will be
used to save and load persisted data. Persistence is controlled by the
persistence
server parameter, which can be set
to off
(no persistence), file
, or file-sequence
. The file
and
file-sequence
options are described in the following sections. Even when
persistence is enabled at the server level, it may be disabled at the data
store level by setting the persistence
data store parameter to off
.
13.2.1. file
persistence¶
The file
persistence option stores the persisted content (the data store
catalog, the list of available roles, and the content of a data store) in a
single file. The data store catalog and the list of available roles are always
saved as a snapshot. The content of a data store is saved as a snapshot
followed by zero or more deltas. When a data store is compacted, the saved data
is replaced by a fresh snapshot, which is extended by deltas as further
transactions are committed on the data store. The process of compacting a data
store first saves the current snapshot into a new file, and then it atomically
replaces the old file with the new file. Consequently, compacting a data store
eventually frees the storage occupied by any deltas written after the snapshot,
but it may temporarily use additional disk space in order to hold both the old
and the new files.
A server directory containing this persistence type can only be used by one RDFox server at a time. RDFox ensures that this is the case by seeking an exclusive lock on the directory when the server instance is created and exiting if the lock cannot be obtained.
RDFox will update the persisted data in a way that is in most cases resilient to RDFox crashing or the system losing power during the saving process. Specifically, if saving of a delta is interrupted for any reason, RDFox will undo any changes made to the data the next time the RDFox process is restarted; in this way, RDFox provides ACID guarantees for transaction updates.
While RDFox guarantees consistency of persisted data due to RDFox crashes and power failure, RDFox is not immune against external damage to persisted files. RDFox will attempt to detect such corruption as follows.
When data is encrypted, the encryption algorithm itself offers protection against corruption: decrypting a damaged file will produce data that has a very high chance to be detected by RDFox as invalid.
When data is not encrypted, RDFox will use the CRC64 checksum algorithm to detect data corruption.
RDFox will refuse to start if corruption is detected in any part of the persisted data. In such cases, the only possible course of action is to restore a recent state of the RDFox database from backup. Consequently, it is highly recommended to create periodic backups of the entire server directory. It is safe to create copies of the server directory even if an RDFox instance is running, provided that the RDFox instance does not write any data during the backup period. If RDFox tries to save a transaction or change the server catalog while a backup is in progress, the backed up data may be invalid (i.e., it cannot be used in future to restore the state of an RDFox server). Consequently, backups of the server directory should only be taken during a maintenance window in which no read/write transactions are performed.
13.2.1.1. System Requirements¶
In order to exclusively lock the server directory , file
persistence uses
the flock
system call with the LOCK_EX
flag on Linux, and the
CreateFileW
system call with the dwShareMode
set to 0 on Windows. In
both cases, the underying file system must faithfully and correctly support the
locking semantics of those calls.
Correctness of file
persistence relies on the following important
system-level considerations.
To guard against sudden power failure, RDFox writes data in multiples of disk sector size. However, determining the sector size programmatically typically requires administrative privileges in modern OSes. Consequently, RDFox relies on users to configure the sector size correctly. RDFox will function correctly as long as its sector size is a multiple of actual sector size; however, using a sector size that size that is larger than what is strictly necessary for the disk may waste a very small amount of storage per transaction. Most disks available nowadays on the market use sectors of 512 or 4096 bytes, so RDFox uses a sector size of 4096 by default as this ensures correctness on commonly used hardware. If RDFox is used on a disk with a different sector size, the correct sector size must be set explicitly using the
persistence.disk-sector-size
server parameter.RDFox relies on system calls that ensure that the data is persisted on disk (
FlushFileBuffers
on Windows,fcntl
with theF_BARRIERFSYNC
option on macOS, andfsyncdata
on Linux). It is well documented that certain disks and disk drivers will “lie” to the operating system; for example, some disks will report that the data has been fully persisted even if the data has not yet been flushed from the disk controller’s cache. Modern operating systems do not provide a way of detecting such situations, and so RDFox has no choice but to “believe” the operating system. If RDFox is used with a disk that “lies” about persistence, data can be lost in case of unexpected power failure or kernel crash. Please check with your disk’s manufacturer whether their product is safe to be used in a transactional application.On macOS, RDFox uses the
fcntl
with theF_BARRIERFSYNC
option to synchronize data with external storage. This system call is well known to not offer hard persistence guarantees, and in fact it was observed in practice that the data can be kept in disk buffers for a few seconds after the system call is issued. TheF_FULLFSYNC
option offers stronger persistence guarantees, but is known to cause considerable slowdown and can introduce considerable wear and tear with Apple’s SSDs; moreover, even that system call does not completely guarantee no data loss in case of power failure. Please refer to Apple’s documentation about these system calls and their recommendation to useF_BARRIERFSYNC
. Consequently, persisted data is not 100% safe from power failure on macOS. However, in our experience, Mac computers are rarely used to run production-grade databases, and moreover Mac laptops (which are the most common form of Mac computers) are equipped with a battery that considerably reduces the chances of sudden power failure. Thus, relaxing consistency in order to improve performance and reduce wear and tear is acceptable in typical usage scenarios of RDFox on macOS. Please contact Oxford Semantic Technologies if you plan to use RDFox in production on macOS.
13.2.2. file-sequence
persistence¶
The file-sequence
persistence option stores the persisted content (whether
it is the data store catalog, roles database, or the content of a data store)
in a sequence of files with one file per version. The path of each file is
determined by the relevant version number (server, role manager or data store).
Unlike with file
persistence, a server directory using file-sequence
persistence may be shared by several RDFox servers at once. So long as the
underlying file system meets the criteria described in
Section 13.2.2.1, any modification succesfully
made via any of these instances will be replicated to the other instances
eventually. This provides the basis for deploying RDFox in a leaderless, high
availability (HA) configuration. Please refer to
Section 13.2.2.2 for information about replication lag and
Section 21 for details of how to setup HA deployments.
Also unlike the file
persistence type, file-sequence
server directories
can be backed up during write operations, removing the need for maintenance
windows in order to collect backups. This is because the files that form the
file sequence never contain partial deltas.
When data stores persisted with this file type are compacted, old snapshots and deltas are not automatically deleted from the disk. This is because deleting the files that contain these records could cause other RDFox instances restoring the file sequence to diverge from the committed version history. For example, if an instance has not yet restored content from the file corresponding to version v, where v is less than the current data store version, and the file is deleted, that instance will restore versions up to v-1 and then become available to accept writes. Since the path reserved for version v is empty, a write via this instance will succeed, creating a divergence in the version history that could lead to data loss.
The problem described above cannot happen if all of the running instances that
share a server directory have the globally highest version of a data store when
a compaction on the store begins. To allow disk space to be freed for data
stores configured to use file-sequence
persistence, compact supports an
optional argument indicating that redundant files should be deleted. Before
using this option, operators must take measures to ensure that all running
instances are consistent with respect to the data store version. This could be
achieved by shutting down all but one of the instances or by blocking all
inbound write requests and waiting until all instances report the same version
number for the data store. In this latter case, it is advisable to keep the
block on write traffic in place until the compaction operation has completed.
Warning
The option to delete redundant files during a compaction can corrupt your data store if other replicas are in the process of restoring from a file sequence.
Note
In order to use the file-sequence
persistence option, the license key
provided at startup must explicitly support the feature.
Note
The file-sequence
persistence option is EXPERIMENTAL
on Windows and
macOS.
13.2.2.1. System Requirements¶
In order to make it safe for any of the RDFox instances sharing a server directory to persist new transactions, each instance must add files to file sequences in a way that will fail if the version they are trying to create has already been created by another instance. This requires that the file system containing the server directory supports an atomic move or link operation that returns a distinct error code when the target path is already occupied.
The Network File System (NFS) protocol meets the above requirement through the
LINK
operation (documented, in Sectiion 18.9 of RFC5661
<https://datatracker.ietf.org/doc/html/rfc5661#section-18.9>). Oxford Semantic
Technologies has successfully tested the file-sequence
persistence option
under sustained write contention on the following managed NFS services:
In each case, testing was performed by provisioning an instance of the file
system and mounting it to three Linux hosts in separate availability zones
using the mounting procedure recommended by the service provider. An instance
of RDFox was then started in shell
mode on each host with the
server-directory
parameter pointing to the same directory on the NFS file
system. Using one of the instances, a data store was created with the following
command:
dstore create default
Next, the following commands were run on each host:
set output out
rwtest 500 5000
The rwtest
shell command has been designed specifically to detect
replication errors and is described fully in Section 15.2.39. When invoked as
shown, the test attempts to commit a transaction every 2.75 s on average.
Running the command on three instances simultaneously results in frequent write
contention events.
After 72 hours, the rwtest
command was interrupted by issuing Ctrl-C
in
the terminal on each host. This produces a final report that shows the total
number of transactions successfully committed by each instance. The sum of
these numbers was found to match the data store version minus 1 (the initial
data store version) as expected. If more than one of the instances had
concluded that it had successfully created any given version, the sum of these
numbers would be higher. If in any iteration of the test loop on any of the
three instances the content of the data store differed from the expected
content, which is known for each data store version, the test would have
stopped with an error.
The above procedure constitutes a minimum test for qualifying file systems (and
the associated configuration options) for production use in scenarios where
write contention may occur. Users planning deployments of RDFox that use the
file-sequence
persistence option are advised to conduct their own testing
using this procedure. The degree of write contention can be varied in the test
by changing the numeric parameters to the command which represent the minimum
and maximum duration in milliseconds between iterations of the test loop.
It is worth noting that the atomic operation described above is only required
in situations where there is a risk of write contention and that a broader
range of file systems may be safe to use under the constraint that write
contention will not occur. This can be achieved by ensuring (externally) that
all writes will be committed via the same nominated instance. Some approaches
to this are reviewed in Section 21.3. To
qualify file systems for use in such setups, the rwtest
command can be
invoked with the read-only
argument on all but one of the hosts.
Note
On UNIX operating systems, RDFox uses the link
system call as the
atomic commit operation, while on Windows MoveFileW
is used. The
EEXISTS
(UNIX) and ERROR_ALREADY_EXISTS
(Windows) error codes are
interpreted to mean that the commit has failed because another instance
successfully committed a change first.
13.2.2.2. Replication Performance¶
In order for a change to be replicated between instances, the instance writing the change must succesfully commit it to the path reserved for the new version number, and the other instances must then discover it and apply the changes to their own in-memory copies of the data. The time taken for this process is called replication lag.
In all, there are three mechanisms by which a receiving instance can discover
new version files. The first is through polling. The poll interval is
controlled by the file-system-poll-interval
server parameter which has a
default of 60 s. A separate poll interval is established for each persisted
component. This means that, in the case of a server with three persisted data
stores, the file system will be polled five times in each interval: once for
new server versions, once for new role manager versions, and once for each of
the three data stores. It is generally desirable to keep polling intervals long
to minimise load on the file system so, while this mechanism helps bound
worst-case replication lag, it is unsuitable for achieving low average-case
replication lag.
The second mechanism by which new version files are discovered is when a commit fails because the local version has fallen behind the highest persisted version. In this case, the instance will apply the new version and any others that have been created since as soon as the failed transaction has been rolled back. This mechanism is provided to reduce the time taken for an instance to catch up with the latest changes after a commit fails due to write contention. It will not, in most cases, be useful for achieving low average-case replication lag.
The third mechanism for new version files to be discovered is by notification
over UDP. This mechanism gives the lowest possible average-case replication
lag. To activate this mechanism, the notifications-address
server parameter
must be set to a host name and UDP port number separated by a plus (+
)
symbol. An instance configured this way will register itself for notifications
by writing the given notifications address to the server directory itself. For
the mechanism to work, the host name must be resolvable by the other instances
that share the server directory and UDP packets must be able to flow freely
from the other instances to the specified port. With this mechanism in place,
it should be possible to achieve sub-second replication lag for instances
within the same data center assuming that changes are small.
13.2.3. Encryption¶
When using any form of persistence, a RDFox server can be configured to encrypt
and decrypt the data stored in the server directory by supplying a
base64-encoded key via the persistence.encryption-key
server parameter. By
default the AES-256-CBC
cipher, which requires a 256-bit AES key, is used
but this can be changed by setting the persistence.encryption-algorithm
parameter. See the full documentation of the above parameters in
Section 4.3 for more details.