6. How Information is Structured in RDFox¶
This section describes how information is conceptually structured in RDFox.
The different elements and their relationships are represented in the diagram below and described in more detail next.
6.1. Servers¶
A server is a top-level container object of RDFox and it contains any number of data stores. It provides functionality for creating and deleting data stores, as well as some other management functions.
A default server is created when RDFox is run on the command line, and
it can be exposed via the endpoint
command. When RDFox is used
within a Java application, the local server is created on demand via one
of a number of APIs.
Clients can access RDFox through server connections and data store connections. User authentication takes place when a connection is being established.
6.1.1. Server Options¶
The behavior of a server is determined by the following set of options
encoded as key-value pairs. With the exception of num-threads
the
options specified at server creation time cannot be subsequently changed.
Option |
Value |
Description |
---|---|---|
|
|
If the value is
|
|
an integer |
Specifies the maximum of memory (in MB) that the RDFox instance is allowed to use. The default is 0.9 times the installed memory. |
|
an integer |
Specifies the number of threads that the system will use for tasks such as reasoning and importation. The default is the number of logical processors available on the machine. |
|
a string |
Specifies the license content verbatim. This parameter is not set by default. See Section 2.4.3 for the precedence of license-related options. |
|
a string |
Specifies the path to
the license file to
use. The default value
is |
|
a string |
Specifies the server directory. A
directory may only be used
by one RDFox server at a time.
The default value is
|
|
|
If the value is
|
|
|
If the value is
|
6.2. Data Stores¶
A server can contain multiple data stores. Each data store is given a name unique for the server, it manages its own data, rules, and OWL 2 axioms, and it does not interfere with the other data stores in the system.
A data store must have a type and can optionally have a number of relevant options encoded as key-value pairs. Different data stores in a server can be of different type and/or can use different options.
6.2.1. Data Store Types¶
A data store type determines the indexing strategy that RDFox uses to store the data. The choice of the indexing strategy determines the maximum capacity of a data store (i.e., the maximum number of resources and/or facts), its memory footprint, the speed with which it can answer certain types of queries, and whether a data store can be used concurrently. The following data store types are currently supported.
seq
par-simple-nn
par-simple-nw
par-simple-ww
par-complex-nn
par-complex-nw
par-complex-ww
A data store can be either sequential (seq
) or parallel
(par
). A sequential data store supports only single-threaded access,
whereas a parallel data store is able to run tasks such as
materialization in parallel on multiple threads.
The indexing scheme of a data store can be either simple
or
complex
. The simple indexing scheme uses less memory than the
complex one, but it can be less efficient at answering certain queries.
In particular, to answer a triple pattern of the form a b ?X
or
?X b a
, the simple indexing scheme will traverse all triples where
a
occurs in subject or object, whereas the complex indexing scheme
uses a hash index to identify all such triples with constant delay
(i.e., retrieving the first and every other triple requires constant
time). The simple scheme can thus be useful when queries are simple but
memory consumption is a concern, but in most cases the complex scheme
should be used.
On suffixes nn
, nw
, and ww
, the first character determines
whether the system uses 32-bit (n
for narrow) or 64-bit (w
for
wide) unsigned integers for representing resource IDs, and the second
character determines whether the system uses 32-bit (n
) or 64-bit
(w
) unsigned integers for representing triple IDs. Thus, an nw
store can contain at most 4 × 109 resources and at most
1.8 × 1019 triples.
6.2.2. Data Store Options¶
In addition to the data store type, the behavior of a data store is also determined by a number of options encoded as key-value pairs. The options specified at data store creation time cannot be subsequently changed.
6.2.2.1. equality
¶
The equality
option determines how RDFox deals with the semantics of
equality, which is encoded using the owl:sameAs
property. This
option has the following values.
off
: There is no special handling of equality and theowl:sameAs
property is treated as just another property. This is the default if theequality
option is not specified.noUNA
: Theowl:sameAs
property is treated as equality, and the Unique Name Assumption is not used – that is, deriving an equality between two IRIs does not result in a contradiction. This is the treatment of equality in OWL 2 DL.UNA
: : Theowl:sameAs
property is treated as equality, but interpreted under UNA – that is, deriving an equality between two IRIs results in a contradiction, and only equalities between an IRI and a blank node, or between two blank nodes are allowed. Thus, it a triple of the form<IRI₁, owl:sameAs, IRI₂>
is derived, RDFox detects a clash and derives<IRI₁, rdf:type, owl:Nothing>
and<IRI₂, rdf:type, owl:Nothing>
.noUNAnoRef
: Theowl:sameAs
property is treated as equality without UNA, and furthermore no reflexivity axioms are derived. A data store is initialized with this option does not support incremental reasoning.UNAnoRef
: Theowl:sameAs
property is treated as equality with UNA, and furthermore no reflexivity axioms are derived. A data store is initialized with this option does not support incremental reasoning.
In all equality modes (i.e., all modes other than off
), distinct RDF
literals (e.g., strings, numbers, dates) are assumed to refer to
distinct objects, and so deriving an equality between the distinct
literals results in a contradiction.
Note RDFox will reject rules that use negation-as-failure or
aggregation in all equality
modes other than off
.
6.2.2.2. max-data-pool-size
¶
The value of the max-data-pool-size option is an integer that determines the maximum number of bytes that RDFox can use to store resource values (e.g., IRIs and string). Specifying this option can reduce significantly the amount of virtual memory that RDFox uses per data store.
6.2.2.3. max-resource-capacity
¶
The value of the max-resource-capacity option is an integer that determines the maximum number of resources that can be stored in the data store. Specifying this option can reduce significantly the amount of virtual memory that RDFox uses per data store.
6.2.2.4. max-triple-capacity
¶
The value of the max-triple-capacity option is an integer that determines the maximum number of triples that can be stored in one named graph of a data store. Specifying this option can reduce significantly the amount of virtual memory that RDFox uses per data store.
6.2.2.5. init-resource-capacity
¶
The value of the init-resource-capacity
option is an integer that is
used as a hint to the data store specifying the number of resources that
the store will contain. This hint is used to initialize certain data
structures to the sizes that ensure faster importation of data. The
actual number of resources that a data store can contain is not limited
by this option: RDFox will resize the data structures as needed if this
hint is exceeded.
6.2.2.6. init-triple-capacity
¶
The value of the init-triple-capacity
option is an integer that is
used as a hint to the data store specifying the number of triples that
the store will contain. This hint is used to initialize certain data
structures to the sizes that ensure faster importation of data. The
actual number of triple that a data store can contain is not limited by
this option: RDFox will resize the data structures as needed if this
hint is exceeded.
6.2.2.7. import.rename-user-blank-nodes
¶
If the import.rename-user-blank-nodes
option is set to true
, then
user-defined blank nodes imported from distinct files are renamed apart during the
importation process; hence, importing data merges blank nodes
according to the RDF specification. There is no way to control the
process of renaming blank nodes, which can be problematic in some
applications. Because of that, the default value of this option is
false
since this ensures that the data is imported ‘as is’.
Regardless of the state of this option, autogenerated blank nodes (i.e.,
blank nodes obtained by expanging []
or (...)
in Turtle files)
are always renamed apart.
6.2.2.8. import.invalid-literal-policy
¶
The import.invalid-literal-policy
option governs how RDFox handles
invalid literals during import.
error
: Invalid literals in the input are treated as errors, and so files containing such literals cannot be imported. This is the default.as-string
: Invalid literals are converted to string literals during import. Moreover, for each invalid literal, a warning is emitted alerting the user to the fact that the value was converted.as-string-silent
: Invalid literals are converted to string literals during import, but without emitting a warning.
6.2.2.9. auto-update-statistics
¶
The auto-update-statistics
option governs how RDFox manages
statistics about the data loaded into the system. RDFox uses these
statistics during query planning in order to identify an efficient plan,
so query performance may be suboptimal if the statistics are not up to
date. If auto-update-statistics
is set to true
, which is the
default value, then the statistics are updated automatically when the
number of facts in the system changes by more than 10%. If this option
is turned off, then the statistics can be updated manually using the
stats update
command or via one of the available APIs.
6.2.2.10. owl-in-rdf-support
¶
The owl-in-rdf-support
option governs how RDFox translates data
loaded into the EDB
fact domain (Section 6.5.2) into
OWL 2 axioms.
off
: Facts in theEDB
domain are not translated into OWL 2 axioms. This is the default setting.strict
: Facts in theEDB
domain will be translated into OWL 2 axioms, when possible, using the algorithms in https://www.w3.org/TR/owl2-mapping-to-rdf/. Those axioms that conform to the OWL 2 RL profile will be interpreted by RDFox as rules. The facts will not be removed from theEDB
domain by this translation.Example: The following triples will be translated into the following OWL 2 axioms in the
triples
axiom domain (6.3)<X> rdf:type owl:Class . <Y> rdf:type owl:Class . <X> rdfs:subClassOf <Y> . Declaration( Class( <X> ) ) Declaration( Class( <Y> ) ) SubClassOf( <X> <Y> )
relaxed
: Facts in theEDB
domain will be translated with the same algorithm as with thestrict
option, but RDFox will tolerate the omission of some triples representing OWL 2 Declaration axioms while translating other axioms. RDFox will infer the missing declarations where possible.Example: The following triple will be translated into the following OWL 2 axiom in the
triples
axiom domain (6.3), as RDFox can infer that and are classes, but note that no Declaration axioms (and no additional triples) are created.<X> rdfs:subClassOf <Y> . SubClassOf( <X> <Y> )
6.2.2.11. swrl-negation-as-failure
¶
The swrl-negation-as-failure
option determines how RDFox treats
ObjectComplementOf
class expressions in SWRL rules.
off
. SWRL rules are interpreted under the open-world assumption and SWRL rules featuringObjectComplementOf
are rejected. This is the default value.on
. SWRL rules are interpreted under the closed-world assumption, as described in Section 5.7.3.
6.2.2.12. persist-ds
¶
The persist-ds
option controls how RDFox persists data contained in
a data store. The option can be set to:
file
. The content of the data store will be automatically and incrementally saved to files within the server directory. It is only valid to set a the data storepersist-ds
option tofile
if the server parameterpersist-ds
is also set tofile
.off
. The content of the data store will reside in memory only and will discarded when RDFox exits.
If the persist-ds
option is not specified for a data store then it
will use the value of the persist-ds
option specified for the
server.
6.2.3. Data Store Contents¶
A data store contains the types of objects given next. Further detailed information about each type of object is given in subsequent sections.
A single dictionary, which is used to map resources to IDs that are used during query processing. Most users would not use the dictionary directly. However, resources can be large so shipping them in full in each answer to a query could be inefficient. By exposing the dictionary through the API, query answers to be transported as IDs that clients independently resolve in resources; moreover, clients can cache resource IDs, which can significantly improve performance.
A collection of OWL 2 axioms provided by the user. RDFox can parse arbitrary OWL 2 axioms and store them internally so that individual axioms can be deleted from the store later on. RDFox will translate OWL 2 axioms into rules for the purpose of reasoning, as discussed in Section 5.6. If an axiom is deleted from the store, its corresponding rules will also be deleted.
A collection of rules. Rules in a data store can stem from three different sources. First, rules may be provided directly by the user. Second, they may be obtained from the transformation of OWL 2 axioms. Third, they may be internal rules obtained from the axomatization of equality or from the partial encoding of the semantics of OWL 2 constructs.
A collection of tuple tables, each of which acts as a container of facts and is roughly analogous to a relation in relational databases. Thus, a data store does not contain facts directly, but via tuple tables.
A collection of data_sources, each encapsulating a source such as a relational database, a CSV file or an Apache Solr instance. Each data source is assigned a name that is unique for the server.
Zero or one statistics modules. Like most databases, RDFox needs in its operation various statistics about the data it contains. These are mainly used for query planning. Configuring the available statistics is largely of interest for system administrator.
6.2.4. Operations on Data Stores¶
Data stores can be managed using the shell or one of the available APIs.
The following are the main operations allowed on data stores.
Creation. A data store can be created on a server. When creating a data store, we must specify its type and we can optionally also specify relevant options. When a data store is created, a default tuple table is also created automatically (this corresponds to the default graph). Additional tuple tables (e.g., corresponding to named graphs) can be added later on. Upon creation a data store contains no axioms, rules or facts (those can be imported later on) and it also contains no data sources or statistics.
Deletion. A data store can be deleted on a server. A data store can only be deleted if it contains no rules, axioms, facts, data sources or statistics, and if it contains no tuple tables other than the default one. Hence, before deleting a data source one must delete all its contents first.
Loading and saving. A data store can be saved to a binary file and subsequently loaded. A binary file obtained from saving a store contains the whole state of the store (e.g., all its contents). When loading a data store from a file, we will obtain a store that is identical to the one from which the file was generated. RDFox supports the following binary formats.
The ‘standard’ format stores the data in a way that allows fast loading but is more resilient to changes in RDFox implementation. This format should be used in most cases.
The ‘raw’ format stores the data in exactly the same way as the data is stored in RAM. This format allows one to reconstruct the state of a data store exactly and is therefore useful when reporting bugs. However, is is less resilient to updates to RDFox.
6.3. OWL 2 Axioms¶
Each OWL axiom in RDFox is associated with one or both of the following axiom domains.
user
The axioms that a user imports from a file or via an API are stored in the user domain.triples
Any axioms that are derived from triples in the active data store are stored in the triples domain. Only facts in theEDB
fact domain are translated into axioms, and no Assertion axioms are created. For more information about the translation of RDF data into OWL axioms, please refer to https://www.w3.org/TR/owl2-mapping-to-rdf/.
An axiom can belong to both axiom domains, if an axiom has been
explicitly imported and an identical axiom can be translated from the
loaded triples. In this case a single rule will be created in the
axioms
rule domain. Deleting an axiom from one domain will not
affect a copy in the other.
Only the user
axiom domain can be directly affected by users. When
facts are added to or removed from the active data store, the triples
domain will be updated to correspond to the axioms that may be
translated from the updated triples. This recalculation off the
translated axioms (and any resulting update to the axioms
rule
domain) occurs before incremental reasoning is performed.
An axiom domain can be specified when axioms are exported to a file,
defaulting to user
.
6.3.1. Operations on OWL 2 Axioms¶
OWL 2 axioms can be imported into the data store or deleted from the store. Importing an axiom adds the axiom to the axiom container of the data store. Deleting an axiom removes it from the axiom container. When an axiom is imported, RDFox will also add to the store a set of rules obtained from translating the axiom on a best-effort basis; when deleting an axiom, the corresponding rules will also be deleted
Axioms can also be exported in a number of available human-friendly formats (currently, functional-style syntax).
6.4. Rules¶
Each rule in RDFox is associated with one or more of the following three rule domains.
user
The rules that a user imports from a file or via an API are stored in the user domain.axioms
This domain contains a translation of OWL axioms in the active data store (in both axiom domains) converted to rules. Only axioms that conform to the OWL 2 RL profile are translated.internal
If theowl-in-rdf-support
option is set torelaxed
orstrict
, this domain contains a small number of fixed rules that describe some of the semantics of OWL ontologies represented in RDF data. For example, if the RDF data includes the following triples representing a pair of subclass axioms…<X> rdf:type owl:Class . <Y> rdf:type owl:Class . <Z> rdf:type owl:Class . <X> rdfs:subClassOf <Y> . <Y> rdfs:subClassOf <Z> .
… an internal rule encapsulating the transitivity of subclasses will deduce that class is a subclass of class . Rules in the internal domain may be exported but may not be modified. The rules in this domain are listed in an 5.6.4.
A rule can belong to multiple domains. For example a rule could be
derived from one (or more) axiom domains and therefore be present in the
axioms
rule domain and could be imported manually into the user
rule domain. Note that deleting a rule from one domain will not affect a
copy in the other.
Only the user
rule domain can be directly affected by users. If
axioms are updated (either directly by user-modification of the user
axiom domain, or as consequence of modifying data that causes the
triples
axiom domain to change) corresponding updates are
automatically made to the axioms
rule domain.
A rule domain can be specified when rules are exported to a file,
defaulting to user
.
6.4.1. Operations on Rules¶
Rules can be imported into the data store or deleted from the store. Importing a rule adds the rule to the rule container of the data store. Deleting a rule removes it from the rules container.
The rules in a data store can also be exported in a number of available human-friendly formats (currently, Datalog).
6.5. Tuple Tables¶
A data store keeps its data in a number of tuple tables. Each tuple table is identified by a name that is unique for a data store. Moreover, each tuple table has a nonnegative integer arity that determines the number of arguments of facts stored in the tuple table. Arity can be zero, in which case the tuple table is a propositional variable. Finally, one can specify a name for each argument of a tuple table.
Upon creation, a data store contains a ternary tuple table called
internal:triple
that stores triples belonging to the default graph
of the store. The arguments of this tuple table are called subject
,
predicate
, and object
.
Additional tuple tables can be mounted from data sources. In this way, RDFox can access data stored in external data sources, such as relational databases, CSV files, or Apache Solr, and integrate this data in its reasoning process. Furthermore, additional tuple tables may also stem from named graphs.
6.5.1. Types of Tuple Tables¶
Tuple tables can be of one of two types.
A tuple table can be stored in main memory, in which case we call it an in-memory tuple table. Users can explicitly import facts and delete facts for such tuple tables. The internal table
internal:triple
is a modifiable tuple table and so are all tuple tables stemming from named graphs.A tuple table can be backed by a data source, such as a CSV file, a relational database, or an Apache Solr index. We call such tuple tables data source tuple tables. Users cannot explicitly import or delete data for this kind of tuple tables since their contents are determined by the underpinning data source.
Both types of tuple tables are managed using the same API, which is described in this section. All modification functions described in this sections are not transactional: they are applied immediately, and in fact their invocation fails if the connection has an active transaction. Consequently, there is no way to rollback the effects of these functions.
6.5.2. Facts¶
A tuple table contains a collection of facts.
Each fact in RDFox is associated with one or more of the following four fact domains.
EDB
: The facts that a user imports from a file or via an API are stored into the Extensional Database (EDB for short) domain.IDB
: The facts that are derived via rules are stored into the Intensional Database (IDB for short) domain.IDBrep
: If a data store is initialized with support for equality reasoning, whenever a triple<s, owl:sameAs, o>
is derived, RDFox will identify one of them (says
) as a representative resource and replace all occurrences of the other resource with the representative. The facts that consist only of representative resources belong to theIDBrep
domain. When equality reasoning is not used, this domain is identical to theIDB
domain.IDBrepNoEDB
: This domain contains all facts in theIDBrep
, but not theEDB
domain.
A fact can belong to more than one domain. For example, facts added to
the store are stored into the EDB
domain, and during reasoning they
are transferred into the IDB
domain.
Only the EDB
fact domain can be directly affected by users. That is,
all explicitly added facts are added to the EDB
domain, and only
those facts can be deleted. In particular, it is not possible to
manually delete derived facts: the semantics of such deletions is not
clear, and doing so is out of scope of RDFox.
A fact domain can be specified when queries are evaluated. For example,
if a query is evaluated with respect to the EDB
domain, it will
“see” only the facts that were explicitly added to a data store, but not
the facts that were derived by reasoning.
6.5.3. Operations on Tuple Tables¶
An in-memory tuple table can be added to the store. When added, a tuple table does not contain any facts. However, adding a tuple table means that we can start importing rules that explicitly refer to it.
In-memory tuple tables can be deleted provided that they contain no facts and that they are not mentioned in any rules.
Facts can be imported into an in-memory tuple table and also deleted from it.
6.6. Data Sources¶
A data source in a data store encapsulates an external (non-RDF) source such as a relational database, a CSV file, or an Apache Solr index. Each data source is assigned a name that is unique for the server. Moreover, each data source can expose one or more data source tables, each of which allows inspecting the raw data in the source.
A data source provides functionality to mount a tuple table that allows consistent access to the data in the source. The difference between a data source table and a mounted tuple table is subtle but important. A data source table exposes the raw data in the source without any transformations. In contrast, a mounted tuple table can already apply some transformations to the source data. For example, a mounted tuple table can expose a subset of columns from a data source, it can convert some the values in certain columns into URIs, and it can add additional columns derived from the existing ones using a limited set of transformation operations.
6.6.1. CSV Data Sources¶
We next illustrate by means of a running example how to import and manage data from an external CSV data source in RDFox using the command line interface.
dsource add delimitedFile "personDS" \
file "$(dir.root)csv/person.csv" \
header true
This command adds the person.csv
file as a data source. Each data
source is given a user-defined name (personDS
in this case), which
will be useful later for specifying the added data source. To further
explain the format of this command,
add
means that the data source is being added;delimitedFile
determines the data source where records are stored in a file and are delimited by a character, andPostgreSQL
can be used to import a database table.personDS
is the user-defined name of the data source.
The remaining parameters are a list of key-value pairs specific to the
data source type. Most parameters have sensible defaults. The
delimitedFile
data source type has the following parameters.
file
specifies the path to the file.header
specifies whether the file contains a header row and the default isfalse
.delimiter
specifies the character used to split rows into records. The default is,
and<tab>
and<space>
are other possible values with the obvious meaning.quote
specifies the character that can be used to escape a field. The default is empty. If it is set to"
, then a field"abc, def"
in comma-separated file would be parsed as a single field.
After a data source has been added, we can check what tables it contains
(see the following command, for example). A delimitedFile
data
source will always contain just one table; however, a database could contain
zero or more tables.
dsource show personDS
With the following command, one can sample the data from the data source
to see what kind of data is there. This can be used to aid attaching the
data source (see the step after this one). Value 1
specifies the
number of records to be sampled.
dsource sample personDS "$(dir.root)csv/person.csv" 1
The following command attaches a table from the data source as a relation of RDFox. The new relation will have an IRI which can then be used in the queries and rules.
dsource attach fg:person "personDS" \
"columns" 5 \
"1" "http://fg.com/{1}_{2}" \
"1.datatype" "iri" \
"2" "{First Name}" \
"2.datatype" "string" \
"3" "{Last Name}" \
"3.datatype" "string" \
"4" "{Sex}" \
"4.datatype" "string" \
"4.if-empty" "default" \
"4.default" "M" \
"5" "{ 4 }" \
"5.datatype" "integer" \
"5.if-empty" "absent"
The command format is the following.
attach
means that the table is being attached.fg:person
is the IRI of the new relation.personDS
is the name of the data source that the relation is imported from.
The remaining parameters are a list of key-value pairs that describe the
imported relation. The actual parameters depend on the type of the data
source. The following parameters are supported for the delimitedFile
type.
columns
specifies the number of columns in the attached relation. If omitted, the default is the number of columns in the first row of the input delimited file. Note that the number of columns in the file does not need to be the same as the number of columns in the attached relation.k
specifies how the lexical value of the values in the k-th column is constructed. The value is a string that contains elements such as{n}
or{name}
. At runtime,{n}
will be replaced with the string found in the n-th column of the input file, and{name}
will be replaced with the string found in the column of the input file calledname
. The latter is possible only if the input file contains the header row, which then provides the names for the columns in the input file. The default for this parameter is{n}
.k.datatype
specifies the type of the k-th column. It can be an XML schema datatype (e.g.,xml:boolean
), orstring
,iri
,integer
, ordouble
.k.if-empty
specifies how to deal with empty input columns. Note that there are no NULL values in CSV files; however, input columns can be empty, which is the best possible approximation. Now assume that the lexical value for column k is given as{1} abc {2}
. Optionk.if-empty
applies only to the case when both{1}
and{2}
are empty (if, say, just{2}
of them is empty, then{2}
is replaced with the empty string and no further processing is done). Optionk.if-empty
can have the following values and the default value isdefault
.leave
means “leave as is”. In this example,{1}
and{2}
are replaced with the empty string so the resulting lexical value isabc
.absent
means “treat the field as absent”. In this example, the corresponding row in the RDFox relation will have a “hole” in the k-th position. This allows for dealing with absent values in a consistent way.default
means “replace the lexical form with the default for this column (as specified next)”.
k.default
specifies the default value for the k-th column in case explained above.k.invalid-literal-policy
specifies how to deal with values in the input that cannot be converted into valid RDF literals (e.g., if the value for a column of type integer contains letters). Note that RDFox does not load delimited files at the point a data table is attached: loading happens on the fly when a mounted tuple table is accessed during reasoning or query evaluation. Therefore, invalid literals are detected only during reasoning or querying. The default value for this parameter iserror
.error
means that such literals treated as errors. For example, a reasoning process that accesses a delimited file will be interrupted with an error the first time an invalid literal is encountered.as-string-silent
means that invalid literals are silently converted into strings so that reasoning or query answering can continue.
In this example, the attached relation will contain an extra column
1
that contains a unique ID for each person and is constructed from
the first and the last name. Note that the lexical forms may refer to
the source columns both by column position and by column name. Also,
note that column 4
has default M
(i.e., if no sex is specified,
then M
is used as a default), and that column 5
is treated as
absent if the corresponding input field is empty.
Once attached, relations can be used freely in rules and queries. Bear
in mind that relational atoms use round brackets ()
where as RDF
atoms use square brackets []
(see Section 5.4.1).
6.6.2. Apache Solr Data Sources¶
In this section we describe how to access Apache Solr from RDFox using the command line interface.
To connect to Apache Solr, we first add a data source of type solr
.
The following command adds a data source called businessDS
, which is
configured to provide access to the indexes companies
and people
served by the Solr instance listening on port 8983 of localhost.
dsource add solr "businessDS" \
host localhost \
port 8983 \
indexes "companies,people"
A full list of available parameters for data sources of type solr
are
described below.
host
specifies where the Solr instance is hosted;port
orservice-name
specifies the port on which the Solr instance is listening;protocol
determines the network layer protocol to be used for connecting to the Solr instance: ‘IPv4’ (default), ‘IPv6’, or ‘IPv6-v4’;channel
specifies the channel type: ‘unsecure’ (default), ‘secure-transport’, ‘open-ssl’;connection-keep-alive-time
determines how long http connections should be kept alive in the HTTP connection pool used by the data source;indexes
is a comma separated list of index names in the Solr instance that are going to be accessible from the defined data source.
As with all data sources, we can show information about businessDS
using
the following command, which will list all Solr indexes and their fields.
dsource show businessDS
Furthermore, we can sample the first 10 documents in the Solr index
companies
using the following command.
dsource sample businessDS companies 10
To attach a tuple table against a particular index, we use the
dsource attach
command.
For example, the following command attaches a tuple table fg:companyName
,
to the Solr index companies
.
dsource attach fg:companyName \
"index" "companies" \
"solr.q" "*:*" \
"columns" 1 \
"1" "{company_name}"
The command defines fg:companyName
as a tuple table with a single column,
which is bound to the document attribute company_name
of the underlying
index. All parameters used to describe the column definitions are the same as
in the case of delimited data sources, except for the parameter k.type
,
which will be discussed shortly.
The target Solr index is specified using the parameter index
. The Solr
parameter q
, which determines the Solr query to be used when accessing
data from the index, is specified as solr.q
. All Solr parameters are
specified using the prefix solr.
. Commonly used
Solr parameters
are given next.
q
determines the Solr query evaluated against the index;sort
determines the sort criteria for the query result (defaultscore desc
);rows
determines the maximum number of rows returned by Solr (default10
).
The Solr parameters wt
, fl
, and omitHeader
are reserved by RDFox
and specifying them will result in an error.
In our next example, we make use of Solr’s full text search capabilities. With
the following command we define a tuple table fg:caCompanyName
for the
names of the first 1000 companies whose addresses contain the string “CA”.
dsource attach fg:caCompanyName \
"index" "companies" \
"solr.q" "address:*CA*" \
"solr.rows" 1000 \
"columns" 1 \
"1" "{company_name}"
To this end, we have updated the Solr query parameter q
, and we have
specified the Solr parameter rows
. For an overview of the Solr query
language refer to the Solr documentation.
Note that in many scenarios, the Solr query may become fully specified not
during tuple table creation, but during SPARQL query formulation, or even
during SPARQL query evaluation. In our final example, we introduce the notion
of a parameter column and we demonstrate how such columns can be used to
dynamically specify the Solr query (or any other Solr parameter). Consider
the following command, which defines a tuple table fg:companyNameByAddress
with two columns.
dsource attach fg:companyNameByAddress \
"index" "companies" \
"solr.q" "address:*{2}*" \
"solr.rows" 1000 \
"columns" 2 \
"1" "{company_name}" \
"2.type" "parameter"
As before, the first column is going to be populated from the Solr index with
the name of each matched company. The second column, however, is a
parameter column, signified by its type parameter
. The value of this
column has to be specified every time the tuple table is accessed. This
parameter column is referenced in the definition of the Solr query
solr.q
using its index 2
. This reference will be dynamically expanded
to the lexical form of the resource that is bound to the second position of the
tuple table during query or rule evaluation.
So, for example, the following SPARQL query will return the names of the first 1000 companies whose addresses contain either “CA” or “NY”.
SELECT ?name WHERE {
VALUES ?address { "CA", "NY"}
TT fg:companyNameByAddress{ ?name ?address }
}
Note that the values passed to parameter columns need not be constants in the query, but can also be dynamically bound from the data, as shown in the following query.
SELECT ?name WHERE {
?address a fg:addressPattern .
TT fg:companyNameByAddress{ ?name ?address }
}
By default, RDFox escapes
Solr special characters
in lexical forms during parameter expansion. In some cases, however, the escaping of Solr
special characters may not be the desired behaviour, like, for example, when the
entire Solr query is parametrised. To indicate that a reference to a parameter
column with index k
should not be escaped during expansion, one needs to
prefix it using the +
sign, as shown in the following example.
dsource attach fg:companyNameByAddress \
"index" "companies" \
"solr.q" "{+2}" \
"solr.rows" 1000 \
"columns" 2 \
"1" "{company_name}" \
"2.type" "parameter"
We can now ask for all companies that have “CA” in their addresses as follows.
SELECT ?name WHERE {
VALUES ?solrQuery { "address:*CA*" }
TT fg:companyNameByAddress{ ?name ?solrQuery }
}
Note that, if we instead use the reference {2}
, the query would fail,
because RDFox will escape the special characters :
and *
.
A number of restrictions apply to the use of parameter columns, which we outline next.
A tuple table can have any number of parameter columns.
Parameter columns can be used in any Solr parameter (e.g.
rows
,sort
, etc.).Parameter columns can be referenced in multiple fields and multiple times.
All parameter columns have to be specified/bound when accessing the tuple table. For example, the following query is invalid, since
?address
cannot be bound.
SELECT * WHERE { TT fg:companyNameByAddress{ ?name ?address } }
Solr parameters can only refer to parameter columns (the use of
"solr.q" "address:*{1}*"
will result in an error).Properties of parameter columns are ignored.