User documentation¶

RecSQL package¶

RecSQL is a simple module that provides a numpy.record array frontend to an underlying SQLite table.

The SQLarray object populates a SQL table from a numpy record array, a iterable that supplies table records, or a string that contains an especially simple reStructured text table. The SQL table is held in memory and functions are provided to run SQL queries and commands on the underlying database. Queries return record arrays if possible (although a flag can explicitly change this).

Query results are cached to improve performance. This can be disabled (which is recommened for large data sets).

The SQL table is named on initialization. Later one can refer to this table by the name or the magic name __self__ in SQL statements. Additional tables can be added to the same database (by using the connection keyword of the constructor)

The recsql.rest_table module uses the base functionality to parse a restructured text table from a string (such as a doc string) and returns a nicely structured table. This allows for use of parameters that are documented in the doc strings.

Important functions and classes¶

A SQLarray can be constructed by either reading data from a CSV file or reST table with the SQLarray_fromfile() function or constructed directly from a numpy.recarray via the SQLarray constructor.

recsql.SQLarray_fromfile(filename, **kwargs)¶

Create a SQLarray from filename.

Uses the filename suffix to detect the contents:

rst, txt: restructure text (see recsql.rest_table
csv: comma-separated (see recsql.csv_table)

Arguments :	filename name of the file that contains the data with the appropriate file extension kwargs additional arguments for `SQLarray` additional arguments `recsql.csv_table.Table2array` or `recsql.rest_table.Table2array` such as mode or autoncovert.

class recsql.SQLarray(name=None, records=None, filename=None, columns=None, cachesize=5, connection=None, is_tmp=False, **kwargs)¶

A SQL table that returns (mostly) rec arrays.

SQLarray([name[, records[, columns[, cachesize=5, connection=None]]]])¶

Arguments :

name: table name (can be referred to as ‘__self__’ in SQL queries)
records: numpy record array that describes the layout and initializes the table OR any iterable (and then columns must be set, too) OR a string that contains a single, simple reStructured text table (and the table name is set from the table name in the reST table.) If None then simply associate with existing table name.
filename: Alternatively to records, read a reStructured table from filename.
columns: sequence of column names (only used if records does not have attribute dtype.names) [None]
cachesize: number of (query, result) pairs that are cached [5]
connection: If not None, reuse this connection; this adds a new table to the same database, which allows more complicated queries with cross-joins. The table’s connection is available as the attribute T.connection. [None]
is_tmp: True: create a tmp table; False: regular table in db [False]

Bugs :

InterfaceError: Error binding parameter 0 - probably unsupported type

In this case the recarray contained types such as numpy.int64 that are not understood by sqlite. Either convert the data manually (by setting the numpy dtypes yourself on the recarray, or better: feed a simple list of tuples (“records”) to this class in records. Make sure that these tuples only contain standard python types. Together with records you will also have to supply the names of the data columns in the keyword argument columns.

If you are reading from a file then it might be simpler to use recsql.sqlarray.SQLarray_fromfile().

SELECT(fields, *args, **kwargs)¶

Execute a simple SQL SELECT statement and returns values as new numpy rec array.

The arguments fields and the additional optional arguments are simply concatenated with additional SQL statements according to the template:

SELECT <fields> FROM __self__ [args]

The simplest fields argument is "*".

Example:

Create a recarray in which students with average grade less than 3 are listed:

result = T.SELECT("surname, subject, year, avg(grade) AS avg_grade",
               "WHERE avg_grade < 3", "GROUP BY surname,subject",
               "ORDER BY avg_grade,surname")

The resulting SQL would be:

SELECT surname, subject, year, avg(grade) AS avg_grade FROM __self__
     WHERE avg_grade < 3
     GROUP BY surname,subject
     ORDER BY avg_grade,surname

Note how one can use aggregate functions such avg().

The string ‘__self__’ is automatically replaced with the table name (T.name); this can be used for cartesian products such as

LEFT JOIN __self__ WHERE ...

Note

See the documentation for sql() for more details on the available keyword arguments and the use of ? parameter interpolation.

limits(variable)¶: Return minimum and maximum of variable across all rows of data.

merge(recarray, columns=None)¶

Merge another recarray with the same columns into this table.

Arguments :	recarray numpy record array that describes the layout and initializes the table
Returns :	n number of inserted rows
Raises :	Raises an exception if duplicate and incompatible data exist in the main table and the new one.

merge_table(name)¶

Merge an existing table in the database with the __self__ table.

Executes as 'INSERT INTO __self__ SELECT * FROM <name>'. However, this method is probably used less often than the simpler merge().

Arguments :	name name of the table in the database (must be compatible with __self__)
Returns :	n number of inserted rows

recarray¶: Return underlying SQL table as a read-only record array.

selection(SQL, parameters=None, **kwargs)¶

Return a new SQLarray from a SELECT selection.

This is a very useful method because it allows one to build complicated selections and essentially new tables from existing data.

Examples:

s = selection('a > 3')
s = selection('a > ?', (3,))
s = selection('SELECT * FROM __self__ WHERE a > ? AND b < ?', (3, 10))

sql(SQL, parameters=None, asrecarray=True, cache=True)¶

Execute sql statement.

Arguments :

SQL : string: Full SQL command; can contain the ? place holder so that values supplied with the parameters keyword can be interpolated using the pysqlite interface.
parameters : tuple: Parameters for ? interpolation.
asrecarray : boolean: True: return a numpy.recarray if possible; False: return records as a list of tuples. [True]
cache : boolean: Should the results be cached? Set to False for large queries to avoid memory issues. Queries with ? place holders are never cached. [True]

Warning

There are no sanity checks applied to the SQL.

If possible, the returned list of tuples is turned into a numpy record array, otherwise the original list of tuples is returned.

Warning

Potential BUG: if there are memory issues then it can happen that we just silently fall back to a tuple even though calling code expects a recarray; because we swallowed ANY exception the caller will never know

The last cachesize queries are cached (for cache=True) and are returned directly unless the table has been modified.

Note

‘__self__’ is substituted with the table name. See the doc string of the SELECT() method for more details.

sql_index(index_name, column_names, unique=True)¶: Add a named index on given columns to improve performance.

sql_select(fields, *args, **kwargs)¶

Execute a simple SQL SELECT statement and returns values as new numpy rec array.

The arguments fields and the additional optional arguments are simply concatenated with additional SQL statements according to the template:

SELECT <fields> FROM __self__ [args]

The simplest fields argument is "*".

Example:

Create a recarray in which students with average grade less than 3 are listed:

result = T.SELECT("surname, subject, year, avg(grade) AS avg_grade",
               "WHERE avg_grade < 3", "GROUP BY surname,subject",
               "ORDER BY avg_grade,surname")

The resulting SQL would be:

SELECT surname, subject, year, avg(grade) AS avg_grade FROM __self__
     WHERE avg_grade < 3
     GROUP BY surname,subject
     ORDER BY avg_grade,surname

Note how one can use aggregate functions such avg().

The string ‘__self__’ is automatically replaced with the table name (T.name); this can be used for cartesian products such as

LEFT JOIN __self__ WHERE ...

Note

See the documentation for sql() for more details on the available keyword arguments and the use of ? parameter interpolation.

Example¶

>>> from recsql import SQLarray
>>> import numpy
>>> a = numpy.rec.fromrecords(numpy.arange(100).reshape(25,4), names='a,b,c,d')
>>> Q = SQLarray('my_name', a)
>>> print repr(Q.recarray)
rec.array([(0, 1, 2, 3), (4, 5, 6, 7), (8, 9, 10, 11), (12, 13, 14, 15),
       (16, 17, 18, 19), (20, 21, 22, 23), (24, 25, 26, 27),
       (28, 29, 30, 31), (32, 33, 34, 35), (36, 37, 38, 39),
       (40, 41, 42, 43), (44, 45, 46, 47), (48, 49, 50, 51),
       (52, 53, 54, 55), (56, 57, 58, 59), (60, 61, 62, 63),
       (64, 65, 66, 67), (68, 69, 70, 71), (72, 73, 74, 75),
       (76, 77, 78, 79), (80, 81, 82, 83), (84, 85, 86, 87),
       (88, 89, 90, 91), (92, 93, 94, 95), (96, 97, 98, 99)],
      dtype=[('a', '<i4'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])
>>> Q.SELECT('*', 'WHERE a < 10 AND b > 5')
rec.array([(8, 9, 10, 11)],
    dtype=[('a', '<i4'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])
# creating new SQLarrays:
>>> R = Q.selection('a < 20 AND b > 5')
>>> print R
<recsql.sqlarray.SQLarray object at 0x...>

Additional SQL functions¶

Note that the SQL database that is used as the backend for SQLarray has a few additional functions defined in addition to the standard SQL available in sqlite. These can be used in SELECT statements and often avoid post-processing of record arrays in python. It is relatively straightforward to add new functions (see the source code and in particular the recsql.sqlarray.SQLarray._init_sql_functions() method; the functions themselves are defined in the module recsql.sqlfunctions).

Simple SQL functions¶

Simple functions transform a single input value into a single output value:

Expression	SQL equivalent
y = f(x)	SELECT f(x) AS y

Additional simple functions have been defined:

Simple SQL f()	description
sqrt(x)	square root `math.sqrt()`
pow(x,y)	power x**y
regexp(pattern,string)	string REGEXP pattern
match(pattern,string)	string MATCH pattern (anchored REGEXP)
fformat(format,x)	string formatting of a single value format % x

Aggregate SQL functions¶

Aggregate functions combine data from a query; they are typically used with a ‘GROUP BY col’ clause. They can be thought of as numpy ufuncs:

Expression	SQL equivalent
y = f(x1,x2,...xN)	SELECT f(x) AS y ... GROUP BY x

For completeness, the tablke also lists sqlite built-in aggregate functions:

Simple aggregate f()	description
avg(x)	mean [sqlite builtin]
std(x)	standard deviation (using N-1 variance)
median(x)	median of the data (see `numpy.median()`)
min(x)	minimum [sqlite builtin]
max(x)	maximum [sqlite builtin]

PyAggregate SQL functions¶

PyAggregate functions act on a list of data points in the same way as ordinary aggregate functions but they return python objects such as numpy arrays, or tuples of numpy arrays (eg bin edges and histogram). In order to make this work, specific types have to be declared when returning the results:

For instance, the histogram() function returns a python Object, the tuple (histogram, edges):

a.sql('SELECT histogram(x) AS "x [Object]" FROM __self__', asrecarray=False)

The return type (‘Object’) needs to be declared with the 'AS "x [Object]"' syntax (note the quotes). (See more details in the sqlite documentation under adapters and converters.) The following table lists all PyAggregate functions that have been defined:

PyAggregate	type	signature; description
array	NumpyArray	array(x); a standard `numpy.array()`
histogram	Object	histogram(x,nbins,xmin,xmax); histogram x in nbins evenly spaced bins between xmin and xmax
distribution	Object	distribution(x,nbins,xmin,xmax); normalized histogram whose integral gives 1
meanhistogram	Object	meanhistogram(x,y,nbins,xmin,xmax); histogram data points y along x and average all y in each bin
stdhistogram	Object	stdhistogram(x,y,nbins,xmin,xmax); give the standard deviation (from N-1 variance) std(y) = sqrt(Var(y)) with Var(y) = <(y-<y>)^2>
medianhistogram	Object	medianhistogram((x,y,nbins,xmin,xmax); median(y)
minhistogram	Object	minhistogram((x,y,nbins,xmin,xmax); min(y)
maxhistogram	Object	maxhistogram((x,y,nbins,xmin,xmax); max(y)
zscorehistogram	Object	zscorehistogram((x,y,nbins,xmin,xmax); <abs(y-<y>)>/std(y)

Examples of using types in tables¶

The following show how to use the special types.

Declare types as ‘NumpyArray’:

a.sql("CREATE TABLE __self__(a NumpyArray)")

Then you can simply insert python objects (type(my_array) == numpy.ndarray):

a.sql("INSERT INTO __self__(a) values (?)", (my_array,))

When returning results of declared columns one does not have to do anything

(my_array,) = a.sql("SELECT a FROM __self__")

although one can also do

(my_array,) = q.sql('SELECT a AS "a [NumpyArray]" FROM __self__')

but when using a PyAggregate the type must be declared:

a.sql('SELECT histogram(x,10,0.0,1.5) as "hist [Object]" FROM __self__')

class recsql.SQLarray(name=None, records=None, filename=None, columns=None, cachesize=5, connection=None, is_tmp=False, **kwargs)

A SQL table that returns (mostly) rec arrays.

SQLarray([name[, records[, columns[, cachesize=5, connection=None]]]])

Arguments :

name: table name (can be referred to as ‘__self__’ in SQL queries)
records: numpy record array that describes the layout and initializes the table OR any iterable (and then columns must be set, too) OR a string that contains a single, simple reStructured text table (and the table name is set from the table name in the reST table.) If None then simply associate with existing table name.
filename: Alternatively to records, read a reStructured table from filename.
columns: sequence of column names (only used if records does not have attribute dtype.names) [None]
cachesize: number of (query, result) pairs that are cached [5]
connection: If not None, reuse this connection; this adds a new table to the same database, which allows more complicated queries with cross-joins. The table’s connection is available as the attribute T.connection. [None]
is_tmp: True: create a tmp table; False: regular table in db [False]

Bugs :

InterfaceError: Error binding parameter 0 - probably unsupported type

In this case the recarray contained types such as numpy.int64 that are not understood by sqlite. Either convert the data manually (by setting the numpy dtypes yourself on the recarray, or better: feed a simple list of tuples (“records”) to this class in records. Make sure that these tuples only contain standard python types. Together with records you will also have to supply the names of the data columns in the keyword argument columns.

If you are reading from a file then it might be simpler to use recsql.sqlarray.SQLarray_fromfile().

SELECT(fields, *args, **kwargs)

Execute a simple SQL SELECT statement and returns values as new numpy rec array.

The arguments fields and the additional optional arguments are simply concatenated with additional SQL statements according to the template:

SELECT <fields> FROM __self__ [args]

The simplest fields argument is "*".

Example:

Create a recarray in which students with average grade less than 3 are listed:

result = T.SELECT("surname, subject, year, avg(grade) AS avg_grade",
               "WHERE avg_grade < 3", "GROUP BY surname,subject",
               "ORDER BY avg_grade,surname")

The resulting SQL would be:

SELECT surname, subject, year, avg(grade) AS avg_grade FROM __self__
     WHERE avg_grade < 3
     GROUP BY surname,subject
     ORDER BY avg_grade,surname

Note how one can use aggregate functions such avg().

The string ‘__self__’ is automatically replaced with the table name (T.name); this can be used for cartesian products such as

LEFT JOIN __self__ WHERE ...

Note

See the documentation for sql() for more details on the available keyword arguments and the use of ? parameter interpolation.

limits(variable): Return minimum and maximum of variable across all rows of data.

merge(recarray, columns=None)

Merge another recarray with the same columns into this table.

Arguments :	recarray numpy record array that describes the layout and initializes the table
Returns :	n number of inserted rows
Raises :	Raises an exception if duplicate and incompatible data exist in the main table and the new one.

merge_table(name)

Merge an existing table in the database with the __self__ table.

Executes as 'INSERT INTO __self__ SELECT * FROM <name>'. However, this method is probably used less often than the simpler merge().

Arguments :	name name of the table in the database (must be compatible with __self__)
Returns :	n number of inserted rows

recarray: Return underlying SQL table as a read-only record array.

selection(SQL, parameters=None, **kwargs)

Return a new SQLarray from a SELECT selection.

This is a very useful method because it allows one to build complicated selections and essentially new tables from existing data.

Examples:

s = selection('a > 3')
s = selection('a > ?', (3,))
s = selection('SELECT * FROM __self__ WHERE a > ? AND b < ?', (3, 10))

sql(SQL, parameters=None, asrecarray=True, cache=True)

Execute sql statement.

Arguments :

SQL : string: Full SQL command; can contain the ? place holder so that values supplied with the parameters keyword can be interpolated using the pysqlite interface.
parameters : tuple: Parameters for ? interpolation.
asrecarray : boolean: True: return a numpy.recarray if possible; False: return records as a list of tuples. [True]
cache : boolean: Should the results be cached? Set to False for large queries to avoid memory issues. Queries with ? place holders are never cached. [True]

Warning

There are no sanity checks applied to the SQL.

If possible, the returned list of tuples is turned into a numpy record array, otherwise the original list of tuples is returned.

Warning

Potential BUG: if there are memory issues then it can happen that we just silently fall back to a tuple even though calling code expects a recarray; because we swallowed ANY exception the caller will never know

The last cachesize queries are cached (for cache=True) and are returned directly unless the table has been modified.

Note

‘__self__’ is substituted with the table name. See the doc string of the SELECT() method for more details.

sql_index(index_name, column_names, unique=True): Add a named index on given columns to improve performance.

sql_select(fields, *args, **kwargs)

Execute a simple SQL SELECT statement and returns values as new numpy rec array.

The arguments fields and the additional optional arguments are simply concatenated with additional SQL statements according to the template:

SELECT <fields> FROM __self__ [args]

The simplest fields argument is "*".

Example:

Create a recarray in which students with average grade less than 3 are listed:

result = T.SELECT("surname, subject, year, avg(grade) AS avg_grade",
               "WHERE avg_grade < 3", "GROUP BY surname,subject",
               "ORDER BY avg_grade,surname")

The resulting SQL would be:

SELECT surname, subject, year, avg(grade) AS avg_grade FROM __self__
     WHERE avg_grade < 3
     GROUP BY surname,subject
     ORDER BY avg_grade,surname

Note how one can use aggregate functions such avg().

The string ‘__self__’ is automatically replaced with the table name (T.name); this can be used for cartesian products such as

LEFT JOIN __self__ WHERE ...

Note

See the documentation for sql() for more details on the available keyword arguments and the use of ? parameter interpolation.

recsql.SQLarray_fromfile(filename, **kwargs)

Create a SQLarray from filename.

Uses the filename suffix to detect the contents:

rst, txt: restructure text (see recsql.rest_table
csv: comma-separated (see recsql.csv_table)

Arguments :	filename name of the file that contains the data with the appropriate file extension kwargs additional arguments for `SQLarray` additional arguments `recsql.csv_table.Table2array` or `recsql.rest_table.Table2array` such as mode or autoncovert.

User documentation¶

RecSQL package¶

Important functions and classes¶

Example¶

Additional SQL functions¶

Simple SQL functions¶

Aggregate SQL functions¶

PyAggregate SQL functions¶

Examples of using types in tables¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

User documentation¶

RecSQL package¶

Important functions and classes¶

Example¶

Additional SQL functions¶

Simple SQL functions¶

Aggregate SQL functions¶

PyAggregate SQL functions¶

Examples of using types in tables¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation