10. Accessing Data Sources¶
As explained in Section 4, a data source can be registered with a data store to provide access to an external, non-RDF source of data, such as a relational database, a CSV file, or an Apache Solr index. This data can then be used in queries and rules, this allowing for an integration of heterogenous data within RDFox. Data sources also allow users to use specialist query engines such as Solr for indexed text searches within RDFox queries.
Accessing external data is typically a two-step process.
First, a data source is registered with a data store. This involves specifying the data source name, data source type, and configuration parameters. The supported data source types and the configuration parameters available for each type are documented in Section 10.1.
Second, one or more data source tuple tables are created, which provide suitable ‘views’ into the data of the data source (see Section 9.1). Such tuple tables can be used in used in queries and rules in exactly the same way as all other tuple tables (see Section 9.3). To create a data source tuple table, one must specify parameters that govern how external data is transformed to be compatible with RDF. The available parameters depend on the type of data source, and are described in detail in Section 10.2.
Data sources and their tuple tables can be managed using any of the available
APIs (see Section 13.5.). In the RDFox shell, management is
achieved using the dsource
(see Section 15.2.2.11) and
tupletable
(see Section 15.2.2.42) commands.
10.1. Registering Data Sources¶
This section documents the types of data source that can be registered with RDFox along with their configuration options.
10.1.1. Registering Delimited File Data Sources¶
CSV and other delimited file types are registered as type
delimitedFile
. The parameters for registering this type of data
source are described below.
Parameter |
Default value |
Description |
---|---|---|
|
|
Specifies the character used to delimit the cells in the file. |
|
(none) |
The path to the delimited file to add. This parameter must be specified. |
|
|
Boolean parameter specifying whether the file has a header or not |
|
|
Specifies the character used to quote strings in the file. |
Example: Registering a CSV file as a Data Source using the Shell
Assume the following short CSV file exists as people.csv
in our
shell root directory.
First Name,Last Name,Sex,Age
Beverly,Strickland,F,33
Thomas,Peterson,M,
Urs,Andresson,M,63
Abigail,Edwards,F,52
Abigail,Edwards,,36
The following command registers the file as a delimitedFile
data
source with name peopleDS
.
dsource register delimitedFile "peopleDS" \
file "$(dir.root)people.csv" \
header true
Once added the data source can be inspected with the following command.
dsource show personDS
It is also possible to sample the native data source table. The following command samples the first three rows.
dsource sample peopleDS "$(dir.root)people.csv" 3
The second parameter to the dsource sample
command is the name
of the table to sample. In the case of delimitedFile
data
sources, there is only one choice of table but the parameter is
nonetheless required. The path of the file is used in this case.
10.1.2. Registering PostgreSQL Data Sources¶
PostgreSQL data sources are registered as type PostgreSQL
. The
parameters for registering this type of data source are described
below.
Parameter |
Default value |
Description |
---|---|---|
|
The empty string |
The connection string RDFox should use to connect to the PostgreSQL instance. This parameter must be specified. |
|
|
The name of the default schema RDFox should search for any unqualified table names within this data source. |
|
Linux: macOS: Windows: |
The file name of the |
10.1.3. Registering ODBC Data Sources¶
ODBC data sources are registered as type ODBC
. The parameters for
registering this type of data source are described below.
Parameter |
Default value |
Description |
---|---|---|
|
The empty string |
The connection string RDFox should use to connect to the ODBC instance. This parameter must be specified. |
|
The empty string |
The name of the default schema RDFox should search for any unqualified table names within this data source. |
|
Linux: macOS: Windows: |
The file name of the library RDFox should dynamically load to act as the driver the driver manager for connecting to the ODBC instance. |
|
|
Specifies the character used to open and close quotes within the data source. |
|
|
Specifies the character used to open quotes within the data source. |
|
|
Specifies the character used to close quotes within the data source. |
10.1.4. Registering Solr Data Sources¶
Solr Data Sources are registered as type solr
. The parameters for
registering this type of data source are described below.
Parameter |
Default value |
Description |
---|---|---|
|
|
Specifies the channel type used to connect to the Solr instance. Valid values for this parameter are:
|
|
|
Determines how long, in seconds, HTTP connections are kept alive in the connection pool for the Solr instance. |
|
|
The name or IP address of the host running the Solr instance. This parameter must be specified. |
|
(none) |
A comma-separated list of index names in the Solr instance that will be accessible from the defined data source. |
|
|
Port number on |
|
|
Specifies the network layer protocol to be used for connecting to the Solr instance. Valid values for this parameter are:
|
10.2. Creating Data Source Tuple Tables¶
Creating a data source tuple table requires an IRI to use as the table’s name and a set of parameters specifying how to construct the table’s values from one of the data store’s registered data sources.
The parameters specifying a data source tuple table can be broken into
two groups. The first group determines the exact source table the
tuple table will be bound to. The source table could be one of the data
source’s tables or, depending on the type of data source, the result of
a query against the data source as a whole. With the exception of
dataSourceName
, parameters in this group are specific to each type
of data source. The second group of parameters determines how the
columns of the tuple table are formed from the columns of the source
table. These parameters are the same for all data source types. The
following diagram illustrates the relationships between data source
tables, data source tuple tables, queries and source tables.
This section will first document the tuple table creation parameters common to all types of data source, and proceed to documenting the specifics of creating tuple tables for each of the supported data source types.
Note
In this section, angle brackets are used to denote a variable
portion of a parameter or placeholder name. Furthermore, <n>
is
used to denote a 1-based column index into the source table, <N>
is used to denote the total number of columns the source table and
<k>
is used to denote a 1-based column index into the data
source tuple table to be created.
10.2.1. General Data Source Tuple Table Creation Parameters¶
Creating a data source tuple table for any type of data source requires
a description of how to map the columns of the source table (specified
by the parameter dataSourceName
in combination with other
type-specific parameters) to the columns of the tuple table to be
created. The parameters for specifying this mapping, which are common
to all types of data source, are as follows.
Parameter |
Default value |
Description |
---|---|---|
|
|
The number of columns the data source tuple table will have. |
|
(none) |
The name of the data source. This parameter must be specified. |
|
|
Template for the lexical form of values in the
|
|
|
The datatype of values in the |
|
|
The default value for the |
|
|
Specifies the policy for dealing with empty values
in input columns. This policy is only consulted if
all of the source table or query values
referred to within the lexical form template Valid values for this parameter are:
|
|
|
Specifies the action to take when the lexical form
of a value calculated for the Valid values for this parameter are:
|
To determine the lexical form of each value of the <k>
-th column of a
data source tuple table, RDFox applies basic string interpolation to
the lexical form template specified via parameter <k>
. The following
table lists the available placeholder names which RDFox will replace
with the appropriate value for each row. Placeholder names should be
be enclosed in braces {}
within the lexical form template.
Placeholder |
Meaning |
---|---|
|
The lexical form of the value in the |
|
The lexical form of the value in the column with
name |
|
The index of the current row. |
10.2.2. Creating Delimited File Data Source Tuple Tables¶
Since data sources of type delimitedFile
only have one possible
source table and do not support queries, creating associated tuple
tables does not require any special parameters.
Example: Creating a CSV Data Source Tuple Table using the Shell
Continuing from the example in
Section 10.1.1, a data source
tuple table with IRI fg:people
, can be created as
follows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | tupletable create fg:people \ dataSourceName "peopleDS" \ "columns" 5 \ "1" "http://fg.com/{1}_{2}_{row#}" \ "1.datatype" "iri" \ "2" "{First Name}" \ "2.datatype" "string" \ "3" "{Last Name}" \ "3.datatype" "string" \ "4" "{Sex}" \ "4.datatype" "string" \ "4.if-empty" "default" \ "4.default" "F" \ "5" "{ 4 }" \ "5.datatype" "integer" \ "5.if-empty" "absent" |
Although the source table contains only four columns, the tuple
table created by the above command will have five. The extra column
will be in position 1
and will contain a
unique ID for each row of the CSV file. These IDs (IRIs) are
constructed from the person’s first and last names as well as the
row number within the file. Including the row number avoids using
the same IRI for two people that have the same first and last names
(such as the two “Abigail Edwards” entries in the file).
Note that the lexical form templates in the example use a mixture of
position- and name-based placeholders to refer to source columns:
position-based for column 1
(line 4) and name-based for columns
2
, 3
and 4
(lines 6, 8 and 10 respectively). Also, note
that column 4
has default F
(i.e., if no sex is specified,
then F
is used as a default), and that column 5
is treated
as absent if the corresponding input field is empty.
Querying this tuple table returns the following tuples.
<http://fg.com/Beverly_Strickland_1> "Beverly" "Strickland" "F" 33 .
<http://fg.com/Thomas_Peterson_2> "Thomas" "Peterson" "M" UNDEF .
<http://fg.com/Urs_Andresson_3> "Urs" "Andresson" "M" 63 .
<http://fg.com/Abigail_Edwards_4> "Abigail" "Edwards" "F" 52 .
<http://fg.com/Abigail_Edwards_5> "Abigail" "Edwards" "F" 36 .
10.2.3. Creating PostgreSQL Data Source Tuple Tables¶
The parameters for creating a data source tuple table for a data
source of type PostgreSQL
are as follows.
Parameter |
Default value |
Description |
---|---|---|
|
(none) |
The name of a data source table to use as the source table for the tuple table being created. Either this or |
|
Value reported by the PostgreSQL instance. |
Specifies whether the |
|
|
Specifies the schema to which
|
|
(none) |
An SQL query used to generated the source table for the tuple table being created. Either this or |
|
Value reported by the PostgreSQL instance. |
Specifies whether the |
10.2.4. Creating ODBC Data Source Tuple Tables¶
The parameters for creating a data source tuple table for a data
source of type ODBC
are as follows.
Parameter |
Default value |
Description |
---|---|---|
|
(none) |
The name of a data source table to use as the source table for the tuple table being created. Either this or |
|
Value reported by the ODBC instance. |
Specifies whether the |
|
(none) |
The time zone RDFox should apply to
values that include times in the
|
|
(none) |
Specifies the schema to which
|
|
(none) |
An SQL query used to generated the source table for the tuple table being created. Either this or |
|
Column width reported by the ODBC instance plus one, all multiplied by the width of the instance’s character type in bytes. |
The size in bytes of the buffer reserved
for string values in the |
|
Value reported by the ODBC instance. |
Specifies whether the |
|
(none) |
The time zone RDFox should apply to
values that include times in the
|
10.2.5. Creating Solr Data Source Tuple Tables¶
The parameters for creating a data source tuple table for a data
source of type solr
are as follows.
Parameter |
Default value |
Description |
---|---|---|
|
(none) |
The name of the Solr index to query. This parameter must be specified. |
|
|
Specifies the type of the Valid values for this parameter are:
See the following discussion of the significance of this parameter. |
|
(none) |
This parameter template provides a mechanism for specifying Solr query options which will be passed through directly with the noted exceptions. See Solr parameters. All Solr paramters are optional but failing to specify |
The Solr data source introduces the notion of parameter columns,
which can be used in the definition of Solr query parameters. A
column with index <k>
may be specified as a parameter column by
setting the value of <k>.type
to parameter
. The values of
parameter columns are determined dynamically during the evaluation of
queries or rules. Once determined, RDFox will interpolate the values of
the parameter columns into the Solr query parameters so that it can
make the appropriate Solr request. Parameterized Solr tuple tables can
only be used in queries and rules in which the values of parameter
columns are determined either directly, using an RDF resource, or
indirectly, using a variable whose value can be bound by the rest of
the query or rule.
Note that use of parameter columns is not mandatory - users may specify
a static value for the solr.q
parameter in which case all columns
may be of type variable.
Example: Creating a Solr Data Source Tuple Table With a Dynamic Query using the Shell
Consider the following command, which assumes the prior registration
of a Solr data source named companiesDS
and creates a tuple
table fg:companyNameByAddress
, with two columns.
tupletable create fg:companyNameByAddress \
dataSourceName "companiesDS" \
"index" "companies" \
"solr.q" "address:*{2}*" \
"solr.rows" 1000 \
"columns" 2 \
"1.type" "parameter" \
"2" "{company_name}"
At the beginning of each iteration of this table, RDFox will
substitute the value from parameter column in position 1
into
the template "address:*{2}*"
to form the query to send to the
Solr instance registered as companiesDS
.
The following SPARQL query, which references the tuple table, will return the names of the first 1000 companies whose addresses contain either “CA” or “NY”.
SELECT ?name WHERE {
VALUES ?address { "CA", "NY"}
TT fg:companyNameByAddress{ ?name ?address }
}
Note that the values passed to parameter columns need not be constants in the query, but can also be dynamically bound from the data, as shown in the following query.
SELECT ?name WHERE {
?address a fg:addressPattern .
TT fg:companyNameByAddress{ ?name ?address }
}
By default, RDFox escapes
Solr special characters
in lexical forms during parameter expansion. In some cases, however, the escaping of Solr
special characters may not be the desired behavior, like, for example, when the
entire Solr query is parameterized. To indicate that a reference to a parameter
column with index <k>
should not be escaped during expansion, one needs to
prefix it using the +
sign, as shown in the following example.
Example: Creating a Solr Data Source Tuple Table With Unescaped Queries using the Shell
The following command demonstrates the syntax for suppressing escaping during parameter expansion.
tupletable create fg:companyNameByAddress \
dataSourceName "companiesDS \
"index" "companies" \
"solr.q" "{+1}" \
"solr.rows" 1000 \
"columns" 2 \
"1.type" "parameter" \
"2" "{company_name}"
We can now ask for all companies that have “CA” in their addresses as follows.
SELECT ?name WHERE {
VALUES ?solrQuery { "address:*CA*" }
TT fg:companyNameByAddress{ ?solrQuery ?name }
}
Note that, if we instead use the reference {1}
, the query would fail,
because RDFox will escape the special characters :
and *
.
A number of restrictions apply to the use of parameter columns, which we outline next.
A tuple table can have any number of parameter columns.
Parameter columns can be used in any Solr parameter (e.g.
rows
,sort
, etc.).Parameter columns can be referenced in multiple fields and multiple times.
All parameter columns have to be specified/bound when accessing the tuple table. For example, the following query would be invalid, since
?address
cannot be bound.
SELECT * WHERE { TT fg:companyNameByAddress{ ?name ?address } }
Solr parameters can only refer to parameter columns (the use of
"solr.q" "address:*{2}*"
in the above examples will result in an error).Properties of parameter columns are ignored.
Note that different values of the parameter columns of a Solr tuple table will typically result in a different ordered set of records. The virtual column
row#
added to each of the result sets will contain the index of each record in the given result set. So, for example, a California-based company may be assigned a differentrow#
depending on whether we query for “address:CA” or “address:US”. Therow#
placeholder should therefore be used with caution when the source table contains parameter columns.
10.3. Data Sources and Incremental Reasoning¶
RDFox never caches data retrieved from a data source: facts are
fetched on demand in order to answer queries or evaluate rules. Due to
RDFox’s materialized reasoning however, triples derived on the basis of
data source facts are added to the IDB
domain and form part of the
data set for subsequent query or rule evaluations without further
retrievals from the data source. Owing to RDFox’s incremental approach
to reasoning, such facts will remain present in the data store
indefinitely. At present, there is no mechanism to force RDFox to
re-evaluate rules based on facts previously retrieved from data
sources.