8. Managing Data Stores¶

As explained in Section 4, a data store encapsulates a unit of logically related information. Many application will store all of their related data in one data store (although some applications may use more than one data store). It is important to keep in mind that a query and rule can operate only on one data store; thus, all information that should be queried or reasoned with in one unit should be loaded into one data store.

As explained in Section 4, a data store serves as a container for other kinds of objects:

tuple tables are data store components that store facts (see Section 9);
data sources can be registered with a data store to access external, non-RDF data (see Section 10);
OWL axioms and Datalog rules are used to specify rules of inference that are to be applied to the data loaded into the data store (see Section 6);
a dictionary keeps track of all RDF resources (i.e., IRIs, blank nodes, and literals) occurring in the facts in the data store; and
statistics modules summarize the data loaded into a data store in a way that helps query planning.

RDFox supports several different types of data stores, which govern various data store aspects such as data store capacity or whether a data store can be simultaneously used by different users; all available types are listed in Section 8.2. Moreover, the behavior of a data store can be customized using various parameters, which are listed in Section 8.3.

8.1. Operations on Data Stores¶

The following list summarizes the operations on data stores available in the shell or via one of the available APIs.

A data store can be created on a server. To create a data store, one must specify the data store type and zero or more parameters expressed as key-value pairs. When a data store is created, a tuple table corresponding to the default graph is created automatically. Additional tuple tables corresponding to named graphs can be created later on. A newly created data store will contain all supported built-in tuple tables (see Section 9.5), but it will not contain any axioms, user-defined rules, or facts, and no data sources will be registered.
A data store can be deleted on the server. RDFox allows a data store to be deleted only if there are no active connections to the data store.
A data store can be saved to and subsequently loaded from a binary file. The file obtained in this way contains all data store content; thus, when a data store is loaded from a file, it is restored to exactly the same state as before saving. RDFox supports the following binary formats.
- The ‘standard’ format stores the data in a way that is more resilient to changes in RDFox implementation. This format should be used in most cases.
- The ‘raw’ format stores the data in exactly the same way as the data is stored in RAM. This format allows one to reconstruct the state of a data store exactly and is therefore useful when reporting bugs, but it is more likely to change between RDFox releases.

8.2. Data Store Types¶

A data store type determines the indexing strategy that RDFox uses to store the data. The choice of the indexing strategy determines the maximum capacity of a data store (i.e., the maximum number of resources and/or facts), its memory footprint, the speed with which it can answer certain types of queries, and whether a data store can be used concurrently. The following data store types are currently supported.

seq
par-simple-nn
par-simple-nw
par-simple-ww
par-complex-nn
par-complex-nw
par-complex-ww

A data store can be either sequential (seq) or parallel (par). A sequential data store supports only single-threaded access, whereas a parallel data store is able to run tasks such as materialization in parallel on multiple threads.

The indexing scheme of a data store can be either simple or complex. The simple indexing scheme uses less memory than the complex one, but it can be less efficient at answering certain queries. In particular, to answer a triple pattern of the form a b ?X or ?X b a, the simple indexing scheme will traverse all triples where a occurs in subject or object, whereas the complex indexing scheme uses a hash index to identify all such triples with constant delay (i.e., retrieving the first and every other triple requires constant time). The simple scheme can thus be useful when queries are simple but memory consumption is a concern, but in most cases the complex scheme should be used.

On suffixes nn, nw, and ww, the first character determines whether the system uses 32-bit (n for narrow) or 64-bit (w for wide) unsigned integers for representing resource IDs, and the second character determines whether the system uses 32-bit (n) or 64-bit (w) unsigned integers for representing triple IDs. Thus, an nw store can contain at most 4 × 10⁹ resources and at most 1.8 × 10¹⁹ triples.

8.3. Data Store Parameters¶

In addition to the data store type, the behavior of a data store is also determined by a number of options encoded as key-value pairs. The options specified at data store creation time cannot be subsequently changed.

8.3.1. `equality`¶

The equality option determines how RDFox deals with the semantics of equality, which is encoded using the owl:sameAs property. This option has the following values.

off: There is no special handling of equality and the owl:sameAs property is treated as just another property. This is the default if the equality option is not specified.
noUNA: The owl:sameAs property is treated as equality, and the Unique Name Assumption is not used — that is, deriving an equality between two IRIs does not result in a contradiction. This is the treatment of equality in OWL 2 DL.
UNA: : The owl:sameAs property is treated as equality, but interpreted under UNA — that is, deriving an equality between two IRIs results in a contradiction, and only equalities between an IRI and a blank node, or between two blank nodes are allowed. Thus, if a triple of the form <IRI₁, owl:sameAs, IRI₂> is derived, RDFox detects a clash and derives <IRI₁, rdf:type, owl:Nothing> and <IRI₂, rdf:type, owl:Nothing>.
chase: The owl:sameAs property is treated as equality with UNA, and furthermore no reflexivity axioms are derived. A data store initialized with this option does not support incremental reasoning. This option is intended to simulate the “chase” procedure commonly used in database research.

In all equality modes (i.e., all modes other than off), distinct RDF literals (e.g., strings, numbers, dates) are assumed to refer to distinct objects, and so deriving an equality between the distinct literals results in a contradiction.

Note RDFox will reject rules that use negation-as-failure or aggregation in all equality modes other than off.

8.3.2. `max-data-pool-size`¶

The value of the max-data-pool-size option is an integer that determines the maximum number of bytes that RDFox can use to store resource values (e.g., IRIs and string). Specifying this option can reduce significantly the amount of virtual memory that RDFox uses per data store.

8.3.3. `max-resource-capacity`¶

The value of the max-resource-capacity option is an integer that determines the maximum number of resources that can be stored in the data store. Specifying this option can reduce significantly the amount of virtual memory that RDFox uses per data store.

8.3.4. `max-triple-capacity`¶

The value of the max-triple-capacity option is an integer that determines the maximum number of triples that can be stored in one named graph of a data store. Specifying this option can reduce significantly the amount of virtual memory that RDFox uses per data store.

8.3.5. `init-resource-capacity`¶

The value of the init-resource-capacity option is an integer that is used as a hint to the data store specifying the number of resources that the store will contain. This hint is used to initialize certain data structures to the sizes that ensure faster importation of data. The actual number of resources that a data store can contain is not limited by this option: RDFox will resize the data structures as needed if this hint is exceeded.

8.3.6. `init-triple-capacity`¶

The value of the init-triple-capacity option is an integer that is used as a hint to the data store specifying the number of triples that the store will contain. This hint is used to initialize certain data structures to the sizes that ensure faster importation of data. The actual number of triple that a data store can contain is not limited by this option: RDFox will resize the data structures as needed if this hint is exceeded.

8.3.7. `import.rename-user-blank-nodes`¶

If the import.rename-user-blank-nodes option is set to true, then user-defined blank nodes imported from distinct files are renamed apart during the importation process; hence, importing data merges blank nodes according to the RDF specification. There is no way to control the process of renaming blank nodes, which can be problematic in some applications. Because of that, the default value of this option is false since this ensures that the data is imported ‘as is’. Regardless of the state of this option, autogenerated blank nodes (i.e., blank nodes obtained by expanding [] or (...) in Turtle files) are always renamed apart.

8.3.8. `import.invalid-literal-policy`¶

The import.invalid-literal-policy option governs how RDFox handles invalid literals during import.

error: Invalid literals in the input are treated as errors, and so files containing such literals cannot be imported. This is the default.
as-string: Invalid literals are converted to string literals during import. Moreover, for each invalid literal, a warning is emitted alerting the user to the fact that the value was converted.
as-string-silent: Invalid literals are converted to string literals during import, but without emitting a warning.

Note that this option applies only to data importation, and not to DELETE/INSERT updates or queries.

8.3.9. `auto-update-statistics`¶

The auto-update-statistics option governs how RDFox manages statistics about the data loaded into the system. RDFox uses these statistics during query planning in order to identify an efficient plan, so query performance may be suboptimal if the statistics are not up to date. If auto-update-statistics is set to true, which is the default value, then the statistics are updated automatically when the number of facts in the system changes by more than 10%. If this option is turned off, then the statistics can be updated manually using the stats update command or via one of the available APIs.

8.3.10. `swrl-negation-as-failure`¶

The swrl-negation-as-failure option determines how RDFox treats ObjectComplementOf class expressions in SWRL rules.

off. SWRL rules are interpreted under the open-world assumption and SWRL rules featuring ObjectComplementOf are rejected. This is the default value.
on. SWRL rules are interpreted under the closed-world assumption, as described in Section 6.7.3.

8.3.11. `persist-ds`¶

The persist-ds option controls how RDFox persists data contained in a data store. The option can be set to:

file. The content of the data store will be automatically and incrementally saved to files within the server directory. The data store option persist-ds can be set to file only if the server parameter persist-ds is also set to file.
off. The content of the data store will reside in memory only and will discarded when RDFox exits.

If the persist-ds option is not specified for a data store then it will use the value of the persist-ds option specified for the server.

8. Managing Data Stores¶

8.1. Operations on Data Stores¶

8.2. Data Store Types¶

8.3. Data Store Parameters¶

8.3.1. equality¶

8.3.2. max-data-pool-size¶

8.3.3. max-resource-capacity¶

8.3.4. max-triple-capacity¶

8.3.5. init-resource-capacity¶

8.3.6. init-triple-capacity¶

8.3.7. import.rename-user-blank-nodes¶

8.3.8. import.invalid-literal-policy¶

8.3.9. auto-update-statistics¶

8.3.10. swrl-negation-as-failure¶

8.3.11. persist-ds¶