Warning: This document is for an old version of RDFox.

6. How Information is Structured in RDFox

This section describes how information is conceptually structured in RDFox.

The different elements and their relationships are represented in the diagram below and described in more detail next.

RDFoxConcepts

RDFoxConcepts

6.1. Servers

A server is a top-level container object of RDFox and it contains any number of data stores. It provides functionality for creating and deleting data stores, as well as some other management functions.

A default server is created when RDFox is run on the command line, and it can be exposed via the endpoint command. When RDFox is used within a Java application, the local server is created on demand via one of a number of APIs.

Clients can access RDFox through server connections and data store connections. User authentication takes place when a connection is being established.

6.2. Data Stores

A server can contain multiple data stores. Each data store is given a name unique for the server, it manages its own data, rules, and OWL 2 axioms, and it does not interfere with the other data stores in the system.

A data store must have a type and can optionally have a number of relevant options encoded as key-value pairs. Different data stores in a server can be of different type and/or can use different options.

6.2.1. Data Store Types

A data store type determines the indexing strategy that RDFox uses to store the data. The choice of the indexing strategy determines the maximum capacity of a data store (i.e., the maximum number of resources and/or facts), its memory footprint, the speed with which it can answer certain types of queries, and whether a data store can be used concurrently. The following data store types are currently supported.

  • seq

  • par-simple-nn

  • par-simple-nw

  • par-simple-ww

  • par-complex-nn

  • par-complex-nw

  • par-complex-ww

A data store can be either sequential (seq) or parallel (par). A sequential data store supports only single-threaded access, whereas a parallel data store is able to run tasks such as materialization in parallel on multiple threads.

The indexing scheme of a data store can be either simple or complex. The simple indexing scheme uses less memory than the complex one, but it can be less efficient at answering certain queries. In particular, to answer a triple pattern of the form a b ?X or ?X b a, the simple indexing scheme will traverse all triples where a occurs in subject or object, whereas the complex indexing scheme uses a hash index to identify all such triples with constant delay (i.e., retrieving the first and every other triple requires constant time). The simple scheme can thus be useful when queries are simple but memory consumption is a concern, but in most cases the complex scheme should be used.

On suffixes nn, nw, and ww, the first character determines whether the system uses 32-bit (n for narrow) or 64-bit (w for wide) unsigned integers for representing resource IDs, and the second character determines whether the system uses 32-bit (n) or 64-bit (w) unsigned integers for representing triple IDs. Thus, an nw store can contain at most 4 × 109 resources and at most 1.8 × 1019 triples.

6.2.2. Data Store Options

In addition to the data store type, the behavior of a data store is also determined by a number of options encoded as key-value pairs. The options specified at data store creation time cannot be subsequently changed.

6.2.2.1. equality

The equality option determines how RDFox deals with the semantics of equality, which is encoded using the owl:sameAs property. This option has the following values.

  • off: There is no special handling of equality and the owl:sameAs property is treated as just another property. This is the default if the equality option is not specified.

  • noUNA: The owl:sameAs property is treated as equality, and the Unique Name Assumption is not used – that is, deriving an equality between two IRIs does not result in a contradiction. This is the treatment of equality in OWL 2 DL.

  • UNA: : The owl:sameAs property is treated as equality, but interpreted under UNA – that is, deriving an equality between two IRIs results in a contradiction, and only equalities between an IRI and a blank node, or between two blank nodes are allowed. Thus, it a triple of the form <IRI₁, owl:sameAs, IRI₂> is derived, RDFox detects a clash and derives <IRI₁, rdf:type, owl:Nothing> and <IRI₂, rdf:type, owl:Nothing>.

  • noUNAnoRef: The owl:sameAs property is treated as equality without UNA, and furthermore no reflexivity axioms are derived. A data store is initialized with this option does not support incremental reasoning.

  • UNAnoRef: The owl:sameAs property is treated as equality with UNA, and furthermore no reflexivity axioms are derived. A data store is initialized with this option does not support incremental reasoning.

In all equality modes (i.e., all modes other than off), distinct RDF literals (e.g., strings, numbers, dates) are assumed to refer to distinct objects, and so deriving an equality between the distinct literals results in a contradiction.

Note RDFox will reject rules that use negation-as-failure or aggregation in all equality modes other than off.

6.2.2.2. max-data-pool-size

The value of the max-data-pool-size option is an integer that determines the maximum number of bytes that RDFox can use to store resource values (e.g., IRIs and string). Specifying this option can reduce significantly the amount of virtual memory that RDFox uses per data store.

6.2.2.3. max-resource-capacity

The value of the max-resource-capacity option is an integer that determines the maximum number of resources that can be stored in the data store. Specifying this option can reduce significantly the amount of virtual memory that RDFox uses per data store.

6.2.2.4. max-triple-capacity

The value of the max-triple-capacity option is an integer that determines the maximum number of triples that can be stored in one named graph of a data store. Specifying this option can reduce significantly the amount of virtual memory that RDFox uses per data store.

6.2.2.5. init-resource-capacity

The value of the init-resource-capacity option is an integer that is used as a hint to the data store specifying the number of resources that the store will contain. This hint is used to initialize certain data structures to the sizes that ensure faster importation of data. The actual number of resources that a data store can contain is not limited by this option: RDFox will resize the data structures as needed if this hint is exceeded.

6.2.2.6. init-triple-capacity

The value of the init-triple-capacity option is an integer that is used as a hint to the data store specifying the number of triples that the store will contain. This hint is used to initialize certain data structures to the sizes that ensure faster importation of data. The actual number of triple that a data store can contain is not limited by this option: RDFox will resize the data structures as needed if this hint is exceeded.

6.2.2.7. import.rename-blank-nodes

If the import.rename-blank-nodes option is set to true, then blank nodes imported from distinct files are renamed apart during the importation process; hence, importing data merges blank nodes according to the RDF specification. There is no way to control the process of renaming blank nodes, which can be problematic in some applications. Because of that, the default value of this option is false since this ensures that the data is imported ‘as is’.

6.2.2.8. import.invalid-literal-policy

The import.invalid-literal-policy option governs how RDFox handles invalid literals during import.

  • error: Invalid literals in the input are treated as errors, and so files containing such literals cannot be imported. This is the default.

  • as-string: Invalid literals are converted to string literals during import. Moreover, for each invalid literal, a warning is emitted alerting the user to the fact that the value was converted.

  • as-string-silent: Invalid literals are converted to string literals during import, but without emitting a warning.

6.2.2.9. auto-update-statistics

The auto-update-statistics option governs how RDFox manages statistics about the data loaded into the system. RDFox uses these statistics during query planning in order to identify an efficient plan, so query performance may be suboptimal if the statistics are not up to date. If auto-update-statistics is set to true, which is the default value, then the statistics are updated automatically when the number of facts in the system changes by more than 10%. If this option is turned off, then the statistics can be updated manually using the stats update command or via one of the available APIs.

6.2.2.10. owl-in-rdf-support

The owl-in-rdf-support option governs how RDFox translates data loaded into the EDB fact domain (Section 6.5.2) into OWL 2 axioms.

  • off: Facts in the EDB domain are not translated into OWL 2 axioms. This is the default setting.

  • strict: Facts in the EDB domain will be translated into OWL 2 axioms, when possible, using the algorithms in https://www.w3.org/TR/owl2-mapping-to-rdf/. Those axioms that conform to the OWL 2 RL profile will be interpreted by RDFox as rules. The facts will not be removed from the EDB domain by this translation.

    Example: The following triples will be translated into the following OWL 2 axioms in the triples axiom domain (6.3)

    <X> rdf:type owl:Class .
    <Y> rdf:type owl:Class .
    <X> rdfs:subClassOf <Y> .
    
    Declaration( Class( <X> ) )
    Declaration( Class( <Y> ) )
    SubClassOf( <X> <Y> )
    
  • relaxed: Facts in the EDB domain will be translated with the same algorithm as with the strict option, but RDFox will tolerate the omission of some triples representing OWL 2 Declaration axioms while translating other axioms. RDFox will infer the missing declarations where possible.

    Example: The following triple will be translated into the following OWL 2 axiom in the triples axiom domain (6.3), as RDFox can infer that and are classes, but note that no Declaration axioms (and no additional triples) are created.

    <X> rdfs:subClassOf <Y> .
    
    SubClassOf( <X> <Y> )
    

See Sections 6.3 and 6.4 for more details.

6.2.2.11. persist-ds

The persist-ds option controls how RDFox persists data contained in a data store. The option can be set to:

  • file. The content of the data store will be automatically and incrementally saved to files within the server directory. It is only valid to set a the data store persist-ds option to file if the server parameter persist-ds is also set to file.

  • off. The content of the data store will reside in memory only and will discarded when RDFox exits.

If the persist-ds option is not specified for a data store then it will use the value of the persist-ds option specified for the server.

6.2.3. Data Store Contents

A data store contains the types of objects given next. Further detailed information about each type of object is given in subsequent sections.

  • A single dictionary, which is used to map resources to IDs that are used during query processing. Most users would not use the dictionary directly. However, resources can be large so shipping them in full in each answer to a query could be inefficient. By exposing the dictionary through the API, query answers to be transported as IDs that clients independently resolve in resources; moreover, clients can cache resource IDs, which can significantly improve performance.

  • A collection of OWL 2 axioms provided by the user. RDFox can parse arbitrary OWL 2 axioms and store them internally so that individual axioms can be deleted from the store later on. RDFox will translate OWL 2 axioms into rules for the purpose of reasoning, as discussed in Section 5.6. If an axiom is deleted from the store, its corresponding rules will also be deleted.

  • A collection of rules. Rules in a data store can stem from three different sources. First, rules may be provided directly by the user. Second, they may be obtained from the transformation of OWL 2 axioms. Third, they may be internal rules obtained from the axomatization of equality or from the partial encoding of the semantics of OWL 2 constructs.

  • A collection of tuple tables, each of which acts as a container of facts and is roughly analogous to a relation in relational databases. Thus, a data store does not contain facts directly, but via tuple tables.

  • A collection of data_sources, each encapsulating a source such as a relational database or a CSV file. Each data source is assigned a name that is unique for the server.

  • Zero or one statistics modules. Like most databases, RDFox needs in its operation various statistics about the data it contains. These are mainly used for query planning. Configuring the available statistics is largely of interest for system administrator.

6.2.4. Operations on Data Stores

Data stores can be managed using the shell or one of the available APIs.

The following are the main operations allowed on data stores.

  • Creation. A data store can be created on a server. When creating a data store, we must specify its type and we can optionally also specify relevant options. When a data store is created, a default tuple table is also created automatically (this corresponds to the default graph). Additional tuple tables (e.g., corresponding to named graphs) can be added later on. Upon creation a data store contains no axioms, rules or facts (those can be imported later on) and it also contains no data sources or statistics.

  • Deletion. A data store can be deleted on a server. A data store can only be deleted if it contains no rules, axioms, facts, data sources or statistics, and if it contains no tuple tables other than the default one. Hence, before deleting a data source one must delete all its contents first.

  • Loading and saving. A data store can be saved to a binary file and subsequently loaded. A binary file obtained from saving a store contains the whole state of the store (e.g., all its contents). When loading a data store from a file, we will obtain a store that is identical to the one from which the file was generated. RDFox supports the following binary formats.

    • The ‘standard’ format stores the data in a way that allows fast loading but is more resilient to changes in RDFox implementation. This format should be used in most cases.

    • The ‘raw’ format stores the data in exactly the same way as the data is stored in RAM. This format allows one to reconstruct the state of a data store exactly and is therefore useful when reporting bugs. However, is is less resilient to updates to RDFox.

6.3. OWL 2 Axioms

Each OWL axiom in RDFox is associated with one or both of the following axiom domains.

  • user The axioms that a user imports from a file or via an API are stored in the user domain.

  • triples Any axioms that are derived from triples in the active data store are stored in the triples domain. Only facts in the EDB fact domain are translated into axioms, and no Assertion axioms are created. For more information about the translation of RDF data into OWL axioms, please refer to https://www.w3.org/TR/owl2-mapping-to-rdf/.

An axiom can belong to both axiom domains, if an axiom has been explicitly imported and an identical axiom can be translated from the loaded triples. In this case a single rule will be created in the axioms rule domain. Deleting an axiom from one domain will not affect a copy in the other.

Only the user axiom domain can be directly affected by users. When facts are added to or removed from the active data store, the triples domain will be updated to correspond to the axioms that may be translated from the updated triples. This recalculation off the translated axioms (and any resulting update to the axioms rule domain) occurs before incremental reasoning is performed.

An axiom domain can be specified when axioms are exported to a file, defaulting to user.

6.3.1. Operations on OWL 2 Axioms

OWL 2 axioms can be imported into the data store or deleted from the store. Importing an axiom adds the axiom to the axiom container of the data store. Deleting an axiom removes it from the axiom container. When an axiom is imported, RDFox will also add to the store a set of rules obtained from translating the axiom on a best-effort basis; when deleting an axiom, the corresponding rules will also be deleted

Axioms can also be exported in a number of available human-friendly formats (currently, functional-style syntax).

6.4. Rules

Each rule in RDFox is associated with one or more of the following three rule domains.

  • user The rules that a user imports from a file or via an API are stored in the user domain.

  • axioms This domain contains a translation of OWL axioms in the active data store (in both axiom domains) converted to rules. Only axioms that conform to the OWL 2 RL profile are translated.

  • internal If the owl-in-rdf-support option is set to relaxed or strict, this domain contains a small number of fixed rules that describe some of the semantics of OWL ontologies represented in RDF data. For example, if the RDF data includes the following triples representing a pair of subclass axioms…

    <X> rdf:type owl:Class .
    <Y> rdf:type owl:Class .
    <Z> rdf:type owl:Class .
    
    <X> rdfs:subClassOf <Y> .
    <Y> rdfs:subClassOf <Z> .
    

    … an internal rule encapsulating the transitivity of subclasses will deduce that class is a subclass of class . Rules in the internal domain may be exported but may not be modified. The rules in this domain are listed in an 5.6.4.

A rule can belong to multiple domains. For example a rule could be derived from one (or more) axiom domains and therefore be present in the axioms rule domain and could be imported manually into the user rule domain. Note that deleting a rule from one domain will not affect a copy in the other.

Only the user rule domain can be directly affected by users. If axioms are updated (either directly by user-modification of the user axiom domain, or as consequence of modifying data that causes the triples axiom domain to change) corresponding updates are automatically made to the axioms rule domain.

A rule domain can be specified when rules are exported to a file, defaulting to user.

6.4.1. Operations on Rules

Rules can be imported into the data store or deleted from the store. Importing a rule adds the rule to the rule container of the data store. Deleting a rule removes it from the rules container.

The rules in a data store can also be exported in a number of available human-friendly formats (currently, Datalog).

6.5. Tuple Tables

A data store keeps its data in a number of tuple tables. Each tuple table is identified by a name that is unique for a data store. Moreover, each tuple table has a nonnegative integer arity that determines the number of arguments of facts stored in the tuple table. Arity can be zero, in which case the tuple table is a propositional variable. Finally, one can specify a name for each argument of a tuple table.

Upon creation, a data store contains a ternary tuple table called internal:triple that stores triples belonging to the default graph of the store. The arguments of this tuple table are called subject, predicate, and object.

Additional tuple tables can be mounted from data sources. In this way, RDFox can access data stored in external data sources, such as relational databases and CSV files, and integrate this data in its reasoning process. Furthermore, additional tuple tables may also stem from named graphs.

6.5.1. Types of Tuple Tables

Tuple tables can be of one of two types.

  • A tuple table can be stored in main memory, in which case we call it an in-memory tuple table. Users can explicitly import facts and delete facts for such tuple tables. The internal table internal:triple is a modifiable tuple table and so are all tuple tables stemming from named graphs.

  • A tuple table can be backed by a data source, such as a CSV file or a relational database. We call such tuple tables data source tuple tables. Users cannot explicitly import or delete data for this kind of tuple tables since their contents are determined by the underpinning data source.

Both types of tuple tables are managed using the same API, which is described in this section. All modification functions described in this sections are not transactional: they are applied immediately, and in fact their invocation fails if the connection has an active transaction. Consequently, there is no way to rollback the effects of these functions.

6.5.2. Facts

A tuple table contains a collection of facts.

Each fact in RDFox is associated with one or more of the following four fact domains.

  • EDB: The facts that a user imports from a file or via an API are stored into the Extensional Database (EDB for short) domain.

  • IDB: The facts that are derived via rules are stored into the Intensional Database (IDB for short) domain.

  • IDBrep: If a data store is initialized with support for equality reasoning, whenever a triple <s, owl:sameAs, o> is derived, RDFox will identify one of them (say s) as a representative resource and replace all occurrences of the other resource with the representative. The facts that consist only of representative resources belong to the IDBrep domain. When equality reasoning is not used, this domain is identical to the IDB domain.

  • IDBrepNoEDB: This domain contains all facts in the IDBrep, but not the EDB domain.

A fact can belong to more than one domain. For example, facts added to the store are stored into the EDB domain, and during reasoning they are transferred into the IDB domain.

Only the EDB fact domain can be directly affected by users. That is, all explicitly added facts are added to the EDB domain, and only those facts can be deleted. In particular, it is not possible to manually delete derived facts: the semantics of such deletions is not clear, and doing so is out of scope of RDFox.

A fact domain can be specified when queries are evaluated. For example, if a query is evaluated with respect to the EDB domain, it will “see” only the facts that were explicitly added to a data store, but not the facts that were derived by reasoning.

6.5.3. Operations on Tuple Tables

An in-memory tuple table can be added to the store. When added, a tuple table does not contain any facts. However, adding a tuple table means that we can start importing rules that explicitly refer to it.

In-memory tuple tables can be deleted provided that they contain no facts and that they are not mentioned in any rules.

Facts can be imported into an in-memory tuple table and also deleted from it.

6.6. Data Sources

A data source in a data store encapsulates an external (non-RDF) source such as a relational database or a CSV file. Each data source is assigned a name that is unique for the server. Moreover, each data source can expose one or more data source tables, each of which allows inspecting the raw data in the source.

A data source provides functionality to mount a tuple table that allows consistent access to the data in the source. The difference between a data source table and a mounted tuple table is subtle but important. A data source table exposes the raw data in the source without any transformations. In contrast, a mounted tuple table can already apply some transformations to the source data. For example, a mounted tuple table can expose a subset of columns from a data source, it can convert some the values in certain columns into URIs, and it can add additional columns derived from the existing ones using a limited set of transformation operations.

We next illustrate by means of a running example how to import and manage data from an external data source in RDFox using the command line interface.

dsource add delimitedFile "personDS" \
    file "$(dir.root)csv/person.csv" \
    header true

This command adds the person.csv file as a data source. Each data source is given a user-defined name (personDSin this case), which will be useful later for specifying the added data source. To further explain the format of this command,

  • add means that the data source is being added;

  • delimitedFile determines the data source where records are stored in a file and are delimited by a character, and PostgreSQL can be used to import a database table.

  • personDS is the user-defined name of the data source.

The remaining parameters are a list of key-value pairs specific to the data source type. Most parameters have sensible defaults. The delimitedFile data source type has the following parameters.

  • file specifies the path to the file.

  • header specifies whether the file contains a header row and the default is false.

  • delimiter specifies the character used to split rows into records. The default is , and <tab> and <space> are other possible values with the obvious meaning.

  • quote specifies the character that can be used to escape a field. The default is empty. If it is set to ", then a field "abc, def" in comma-separated file would be parsed as a single field.

After a data source has been added, we can check what tables it contains (see the following command, for example). A delimitedFile data source will always contain just one table; however, a database could contain zero or more tables.

dsource show personDS

With the following command, one can sample the data from the data source to see what kind of data is there. This can be used to aid attaching the data source (see the step after this one). Value 1 specifies the number of records to be sampled.

dsource sample personDS "$(dir.root)csv/person.csv" 1

The following command attaches a table from the data source as a relation of RDFox. The new relation will have an IRI which can then be used in the queries and rules.

dsource attach fg:person "personDS" \
    "columns" 5                     \
    "1" "http://fg.com/{1}_{2}"     \
    "1.datatype" "iri"              \
    "2" "{First Name}"              \
    "2.datatype" "string"           \
    "3" "{Last Name}"               \
    "3.datatype" "string"           \
    "4" "{Sex}"                     \
    "4.datatype" "string"           \
    "4.if-empty" "default"          \
    "4.default" "M"                 \
    "5" "{ 4 }"                     \
    "5.datatype" "integer"          \
    "5.if-empty" "absent"

The command format is the following.

  • attach means that the table is being attached.

  • fg:person is the IRI of the new relation.

  • personDS is the name of the data source that the relation is imported from.

The remaining parameters are a list of key-value pairs that describe the imported relation. The actual parameters depend on the type of the data source. The following parameters are supported for the delimitedFile type.

  • columns specifies the number of columns in the attached relation. If omitted, the default is the number of columns in the first row of the input delimited file. Note that the number of columns in the file does not need to be the same as the number of columns in the attached relation.

  • k specifies how the lexical value of the values in the k-th column is constructed. The value is a string that contains elements such as {n} or {name}. At runtime, {n} will be replaced with the string found in the n-th column of the input file, and {name} will be replaced with the string found in the column of the input file called name. The latter is possible only if the input file contains the header row, which then provides the names for the columns in the input file. The default for this parameter is {n}.

  • k.datatype specifies the type of the k-th column. It can be an XML schema datatype (e.g., xml:boolean), or string, iri, integer, or double.

  • k.if-empty specifies how to deal with empty input columns. Note that there are no NULL values in CSV files; however, input columns can be empty, which is the best possible approximation. Now assume that the lexical value for column k is given as {1} abc {2}. Option k.if-empty applies only to the case when both {1} and {2} are empty (if, say, just {2} of them is empty, then {2} is replaced with the empty string and no further processing is done). Option k.if-empty can have the following values and the default value is default.

    • leave means “leave as is”. In this example, {1} and {2} are replaced with the empty string so the resulting lexical value is abc.

    • absent means “treat the field as absent”. In this example, the corresponding row in the RDFox relation will have a “hole” in the k-th position. This allows for dealing with absent values in a consistent way.

    • default means “replace the lexical form with the default for this column (as specified next)”.

  • k.default specifies the default value for the k-th column in case explained above.

  • k.invalid-literal-policy specifies how to deal with values in the input that cannot be converted into valid RDF literals (e.g., if the value for a column of type integer contains letters). Note that RDFox does not load delimited files at the point a data table is attached: loading happens on the fly when a mounted tuple table is accessed during reasoning or query evaluation. Therefore, invalid literals are detected only during reasoning or querying. The default value for this parameter is error.

    • error means that such literals treated as errors. For example, a reasoning process that accesses a delimited file will be interrupted with an error the first time an invalid literal is encountered.

    • as-string-silent means that invalid literals are silently converted into strings so that reasoning or query answering can continue.

In this example, the attached relation will contain an extra column 1 that contains a unique ID for each person and is constructed from the first and the last name. Note that the lexical forms may refer to the source columns both by column position and by column name. Also, note that column 4 has default M (i.e., if no sex is specified, then M is used as a default), and that column 5 is treated as absent if the corresponding input field is empty.

Once attached, relations can be used freely in rules and queries. Bear in mind that relational atoms use round brackets () where as RDF atoms use square brackets [] (see Section 5.4.1).