RecSQL is a simple module that provides a numpy.record array frontend to an underlying SQLite table.
The SQLarray object populates a SQL table from a numpy record array, a iterable that supplies table records, or a string that contains an especially simple reStructured text table. The SQL table is held in memory and functions are provided to run SQL queries and commands on the underlying database. Queries return record arrays if possible (although a flag can explicitly change this).
Query results are cached to improve performance. This can be disabled (which is recommened for large data sets).
The SQL table is named on initialization. Later one can refer to this table by the name or the magic name __self__ in SQL statements. Additional tables can be added to the same database (by using the connection keyword of the constructor)
The recsql.rest_table module uses the base functionality to parse a restructured text table from a string (such as a doc string) and returns a nicely structured table. This allows for use of parameters that are documented in the doc strings.
See also
PyTables is a high-performance interface to table data. In most cases you will probably better off in the long run using PyTables than recSQL.
A SQLarray can be constructed by either reading data from a CSV file or reST table with the SQLarray_fromfile() function or constructed directly from a numpy.recarray via the SQLarray constructor.
Create a SQLarray from filename.
Arguments : |
|
---|
A SQL table that returns (mostly) rec arrays.
Arguments : |
|
---|---|
Bugs : |
|
Execute a simple SQL SELECT statement and returns values as new numpy rec array.
The arguments fields and the additional optional arguments are simply concatenated with additional SQL statements according to the template:
SELECT <fields> FROM __self__ [args]
The simplest fields argument is "*".
Create a recarray in which students with average grade less than 3 are listed:
result = T.SELECT("surname, subject, year, avg(grade) AS avg_grade",
"WHERE avg_grade < 3", "GROUP BY surname,subject",
"ORDER BY avg_grade,surname")
The resulting SQL would be:
SELECT surname, subject, year, avg(grade) AS avg_grade FROM __self__
WHERE avg_grade < 3
GROUP BY surname,subject
ORDER BY avg_grade,surname
Note how one can use aggregate functions such avg().
The string ‘__self__’ is automatically replaced with the table name (T.name); this can be used for cartesian products such as
LEFT JOIN __self__ WHERE ...
Note
See the documentation for sql() for more details on the available keyword arguments and the use of ? parameter interpolation.
Return minimum and maximum of variable across all rows of data.
Merge another recarray with the same columns into this table.
Arguments : |
|
---|---|
Returns : | n number of inserted rows |
Raises : | Raises an exception if duplicate and incompatible data exist in the main table and the new one. |
Merge an existing table in the database with the __self__ table.
Executes as 'INSERT INTO __self__ SELECT * FROM <name>'. However, this method is probably used less often than the simpler merge().
Arguments : | name name of the table in the database (must be compatible with __self__) |
---|---|
Returns : | n number of inserted rows |
Return underlying SQL table as a read-only record array.
Return a new SQLarray from a SELECT selection.
This is a very useful method because it allows one to build complicated selections and essentially new tables from existing data.
Examples:
s = selection('a > 3')
s = selection('a > ?', (3,))
s = selection('SELECT * FROM __self__ WHERE a > ? AND b < ?', (3, 10))
Execute sql statement.
Arguments : |
|
---|
Warning
There are no sanity checks applied to the SQL.
If possible, the returned list of tuples is turned into a numpy record array, otherwise the original list of tuples is returned.
Warning
Potential BUG: if there are memory issues then it can happen that we just silently fall back to a tuple even though calling code expects a recarray; because we swallowed ANY exception the caller will never know
The last cachesize queries are cached (for cache=True) and are returned directly unless the table has been modified.
Note
‘__self__’ is substituted with the table name. See the doc string of the SELECT() method for more details.
Add a named index on given columns to improve performance.
Execute a simple SQL SELECT statement and returns values as new numpy rec array.
The arguments fields and the additional optional arguments are simply concatenated with additional SQL statements according to the template:
SELECT <fields> FROM __self__ [args]
The simplest fields argument is "*".
Create a recarray in which students with average grade less than 3 are listed:
result = T.SELECT("surname, subject, year, avg(grade) AS avg_grade",
"WHERE avg_grade < 3", "GROUP BY surname,subject",
"ORDER BY avg_grade,surname")
The resulting SQL would be:
SELECT surname, subject, year, avg(grade) AS avg_grade FROM __self__
WHERE avg_grade < 3
GROUP BY surname,subject
ORDER BY avg_grade,surname
Note how one can use aggregate functions such avg().
The string ‘__self__’ is automatically replaced with the table name (T.name); this can be used for cartesian products such as
LEFT JOIN __self__ WHERE ...
Note
See the documentation for sql() for more details on the available keyword arguments and the use of ? parameter interpolation.
>>> from recsql import SQLarray
>>> import numpy
>>> a = numpy.rec.fromrecords(numpy.arange(100).reshape(25,4), names='a,b,c,d')
>>> Q = SQLarray('my_name', a)
>>> print repr(Q.recarray)
rec.array([(0, 1, 2, 3), (4, 5, 6, 7), (8, 9, 10, 11), (12, 13, 14, 15),
(16, 17, 18, 19), (20, 21, 22, 23), (24, 25, 26, 27),
(28, 29, 30, 31), (32, 33, 34, 35), (36, 37, 38, 39),
(40, 41, 42, 43), (44, 45, 46, 47), (48, 49, 50, 51),
(52, 53, 54, 55), (56, 57, 58, 59), (60, 61, 62, 63),
(64, 65, 66, 67), (68, 69, 70, 71), (72, 73, 74, 75),
(76, 77, 78, 79), (80, 81, 82, 83), (84, 85, 86, 87),
(88, 89, 90, 91), (92, 93, 94, 95), (96, 97, 98, 99)],
dtype=[('a', '<i4'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])
>>> Q.SELECT('*', 'WHERE a < 10 AND b > 5')
rec.array([(8, 9, 10, 11)],
dtype=[('a', '<i4'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])
# creating new SQLarrays:
>>> R = Q.selection('a < 20 AND b > 5')
>>> print R
<recsql.sqlarray.SQLarray object at 0x...>
Note that the SQL database that is used as the backend for SQLarray has a few additional functions defined in addition to the standard SQL available in sqlite. These can be used in SELECT statements and often avoid post-processing of record arrays in python. It is relatively straightforward to add new functions (see the source code and in particular the recsql.sqlarray.SQLarray._init_sql_functions() method; the functions themselves are defined in the module recsql.sqlfunctions).
Simple functions transform a single input value into a single output value:
Expression | SQL equivalent |
---|---|
y = f(x) | SELECT f(x) AS y |
Additional simple functions have been defined:
Simple SQL f() | description |
---|---|
sqrt(x) | square root math.sqrt() |
pow(x,y) | power x**y |
regexp(pattern,string) | string REGEXP pattern |
match(pattern,string) | string MATCH pattern (anchored REGEXP) |
fformat(format,x) | string formatting of a single value format % x |
Aggregate functions combine data from a query; they are typically used with a ‘GROUP BY col’ clause. They can be thought of as numpy ufuncs:
Expression | SQL equivalent |
---|---|
y = f(x1,x2,...xN) | SELECT f(x) AS y ... GROUP BY x |
For completeness, the tablke also lists sqlite built-in aggregate functions:
Simple aggregate f() | description |
---|---|
avg(x) | mean [sqlite builtin] |
std(x) | standard deviation (using N-1 variance) |
median(x) | median of the data (see numpy.median()) |
min(x) | minimum [sqlite builtin] |
max(x) | maximum [sqlite builtin] |
PyAggregate functions act on a list of data points in the same way as ordinary aggregate functions but they return python objects such as numpy arrays, or tuples of numpy arrays (eg bin edges and histogram). In order to make this work, specific types have to be declared when returning the results:
For instance, the histogram() function returns a python Object, the tuple (histogram, edges):
a.sql('SELECT histogram(x) AS "x [Object]" FROM __self__', asrecarray=False)
The return type (‘Object’) needs to be declared with the 'AS "x [Object]"' syntax (note the quotes). (See more details in the sqlite documentation under adapters and converters.) The following table lists all PyAggregate functions that have been defined:
PyAggregate | type | signature; description |
---|---|---|
array | NumpyArray | array(x); a standard numpy.array() |
histogram | Object | histogram(x,nbins,xmin,xmax); histogram x in nbins evenly spaced bins between xmin and xmax |
distribution | Object | distribution(x,nbins,xmin,xmax); normalized histogram whose integral gives 1 |
meanhistogram | Object | meanhistogram(x,y,nbins,xmin,xmax); histogram data points y along x and average all y in each bin |
stdhistogram | Object | stdhistogram(x,y,nbins,xmin,xmax); give the standard deviation (from N-1 variance) std(y) = sqrt(Var(y)) with Var(y) = <(y-<y>)^2> |
medianhistogram | Object | medianhistogram((x,y,nbins,xmin,xmax); median(y) |
minhistogram | Object | minhistogram((x,y,nbins,xmin,xmax); min(y) |
maxhistogram | Object | maxhistogram((x,y,nbins,xmin,xmax); max(y) |
zscorehistogram | Object | zscorehistogram((x,y,nbins,xmin,xmax); <abs(y-<y>)>/std(y) |
The following show how to use the special types.
Declare types as ‘NumpyArray’:
a.sql("CREATE TABLE __self__(a NumpyArray)")
Then you can simply insert python objects (type(my_array) == numpy.ndarray):
a.sql("INSERT INTO __self__(a) values (?)", (my_array,))
When returning results of declared columns one does not have to do anything
(my_array,) = a.sql("SELECT a FROM __self__")
although one can also do
(my_array,) = q.sql('SELECT a AS "a [NumpyArray]" FROM __self__')
but when using a PyAggregate the type must be declared:
a.sql('SELECT histogram(x,10,0.0,1.5) as "hist [Object]" FROM __self__')
A SQL table that returns (mostly) rec arrays.
Arguments : |
|
---|---|
Bugs : |
|
Execute a simple SQL SELECT statement and returns values as new numpy rec array.
The arguments fields and the additional optional arguments are simply concatenated with additional SQL statements according to the template:
SELECT <fields> FROM __self__ [args]
The simplest fields argument is "*".
Create a recarray in which students with average grade less than 3 are listed:
result = T.SELECT("surname, subject, year, avg(grade) AS avg_grade",
"WHERE avg_grade < 3", "GROUP BY surname,subject",
"ORDER BY avg_grade,surname")
The resulting SQL would be:
SELECT surname, subject, year, avg(grade) AS avg_grade FROM __self__
WHERE avg_grade < 3
GROUP BY surname,subject
ORDER BY avg_grade,surname
Note how one can use aggregate functions such avg().
The string ‘__self__’ is automatically replaced with the table name (T.name); this can be used for cartesian products such as
LEFT JOIN __self__ WHERE ...
Note
See the documentation for sql() for more details on the available keyword arguments and the use of ? parameter interpolation.
Return minimum and maximum of variable across all rows of data.
Merge another recarray with the same columns into this table.
Arguments : |
|
---|---|
Returns : | n number of inserted rows |
Raises : | Raises an exception if duplicate and incompatible data exist in the main table and the new one. |
Merge an existing table in the database with the __self__ table.
Executes as 'INSERT INTO __self__ SELECT * FROM <name>'. However, this method is probably used less often than the simpler merge().
Arguments : | name name of the table in the database (must be compatible with __self__) |
---|---|
Returns : | n number of inserted rows |
Return underlying SQL table as a read-only record array.
Return a new SQLarray from a SELECT selection.
This is a very useful method because it allows one to build complicated selections and essentially new tables from existing data.
Examples:
s = selection('a > 3')
s = selection('a > ?', (3,))
s = selection('SELECT * FROM __self__ WHERE a > ? AND b < ?', (3, 10))
Execute sql statement.
Arguments : |
|
---|
Warning
There are no sanity checks applied to the SQL.
If possible, the returned list of tuples is turned into a numpy record array, otherwise the original list of tuples is returned.
Warning
Potential BUG: if there are memory issues then it can happen that we just silently fall back to a tuple even though calling code expects a recarray; because we swallowed ANY exception the caller will never know
The last cachesize queries are cached (for cache=True) and are returned directly unless the table has been modified.
Note
‘__self__’ is substituted with the table name. See the doc string of the SELECT() method for more details.
Add a named index on given columns to improve performance.
Execute a simple SQL SELECT statement and returns values as new numpy rec array.
The arguments fields and the additional optional arguments are simply concatenated with additional SQL statements according to the template:
SELECT <fields> FROM __self__ [args]
The simplest fields argument is "*".
Create a recarray in which students with average grade less than 3 are listed:
result = T.SELECT("surname, subject, year, avg(grade) AS avg_grade",
"WHERE avg_grade < 3", "GROUP BY surname,subject",
"ORDER BY avg_grade,surname")
The resulting SQL would be:
SELECT surname, subject, year, avg(grade) AS avg_grade FROM __self__
WHERE avg_grade < 3
GROUP BY surname,subject
ORDER BY avg_grade,surname
Note how one can use aggregate functions such avg().
The string ‘__self__’ is automatically replaced with the table name (T.name); this can be used for cartesian products such as
LEFT JOIN __self__ WHERE ...
Note
See the documentation for sql() for more details on the available keyword arguments and the use of ? parameter interpolation.
Create a SQLarray from filename.
Arguments : |
|
---|