ARK2/Database
Database
In ARK1, PDO is used to directly access only MySQL databases, and DB access statements are widely spread through the code base and manually coded. While PDO abstracts the connection, it doesn't abstract the SQL dialect so adding support for other databases such as Postgres or SQLite would require considerable work. It also makes migration to proper transaction support and performance improvements difficult, and is a security risk due to programmer error.
A Database Abstraction Layer (DAL) can abstract away the differences in SQL between database systems, and also provide Query Builders, Schema Management, and Migration tools to address the other issues. Most are built on PDO and can seamlessly integrate with legacy code to make for an easier migration path. The Doctrine DBAL DAL has been selected due to it's use in Symfony and the ORM extension being a Data Mapping ORM.
Database Abstraction
Currently, PDO is used to directly access only MySQL databases, and DB access statements are widely spread through the code base and manually coded. While PDO abstracts the connection, it doesn't abstract the SQL dialect so adding support for other databases such as Postgres or SQLite would require considerable work. It also makes migration to proper transaction support and performance improvements difficult, and is a security risk due to programmer error. A Database Abstraction Layer (DAL) can abstract away the differences in SQL between database systems, and also provide Query Builders, Schema Management, and Migration tools to address the other issues. Most are built on PDO and can seamlessly integrate with legacy code to make for an easier migration path.
Longer term, full OO code, most frameworks, and many components use an ORM to map relational data to objects. A key part of choosing a framework or component eco-system is the ORM it uses. Most ORMs however use the Active Record pattern which cannot map onto the existing ARK data model. ARK would require a Data Mapper ORM to access the legacy database structure. While using multiple ORMs would be possible, it is not recommended due to performance overhead and potential contention.
Doctrine ORM is the only PHP Data Mapper available, and is built on the Doctrine DBAL DAL. Doctrine is widely use and under active development, being the main ORM for the Symfony eco-system as well as many independent components. DBAL also provides the full set of required Drivers, Query Builder, Schema Management and Migration tools to abstract access to the required databases.
Automated database creation and upgrade will be implemented using Doctrine Migrations:
- The core database schema will be defined in doctrine-xml (possibly initially reversed engineered from existing v1.x database). This schema will be stored in the build tools folder and will not be deployed.
- A custom Doctrine\DBAL\Migrations\Provider\SchemaProvider class will use the doctrine-xml schema to generate the full schema for new databases.
- A custom Doctrine\DBAL\Migrations\Provider\SchemaProvider class will use the doctrine-xml schema to generate schema diff to automatically create the Doctrine Migration class required for each version upgrade.
- The admin console will provide the standard migration tools (diff perhaps only from build console?)
- The ARK install script will call the doctrine tools to create the database
- If outstanding migrations are found after an upgrade, then the site will go into maintenance mode
The creation of tables required for custom modules will be provided via API and not via the Migrations. This will be part of the data schema creation code and not the core database code.
Multiple database connections within an ARK will be supported, with the database to be used by any data model passed in using dependency injection. This will improve the code used for import, export and migration. It will also allow for separate databases for the admin (user, config, etc) and the data which may assist with AaaS and Multi-tenancy.
Database security will be tightly controlled to prevent security breaches in the AaaS model. There will be four levels of database account:
- Database Admin - Full admin account, credentials never stored on webserver, asked for by install script
- AaaS Admin - Limited admin account for AaaS client admin, only allowed to create new database and add required users, credentials never stored on webserver, passed in to install script by portal script
- ARK Admin - Limited admin account for ARK instance admin (create module tables only?), created by install script, credentials never stored on server, requested from user when needed (create modules, etc)
- ARK User - Limited user account for read/write data access only, created by install script, credentials stored on webserver
ID Generation
A number of possible strategies exist for the allocation of Item numbers, which must be performed in a way abstracted from the database server used, and allowing for the use of offline mobile devices.
- Monotonically increasing sequence within any parent (default)
- Monotonically increasing sequence within any parent and within a predefined range determined by some compulsory attribute of the object, e.g. Trench Number
- Manually allocated IDs in any sequence, but still with the option for default monotonic IDs
Due to the cascading of the generated ID throughout the database structure, the ID needs to be pre-allocated before the update can be performed. However, this risks leaving IDs allocated but unused should the actual update transaction be rolled back. The update to the sequence cannot be reversed, as it may have been further updated by another transaction. Instead such orphans will need to be tracked for later recycling.
Taking all these factors into account, the ID Allocation will function as follows:
- On creation of a new Item, the Entity Manager will be called to create a new ID
- The EM will decide which sequence to use for the ID
- The EM creates a new update transaction on the database and looks in the sequence lock table for any IDs to be recycled
- If there is one to be recycled, it is updated to locked and the ID returned
- If there are none to be recycled, then the EM updates the sequence in the sequence table, creates a new sequence lock entry, and returns the allocated ID
- When the EM performs the actual persistence of the object, then the sequence lock will be deleted
- If the insert transactions fails and is rolled back, then the EM will update the sequence lock to be recycled
For offline mobile use, there are two possible strategies:
- Where the ID carries no inherent meaning, or is not used in the field, then a temporary hidden ID can be used until sync when real IDs are allocated using the standard method above
- Where the ID has meaning and is used in the field, such as numbering context bags or sample buckets, then the real IDs will have to be pre-allocated to the mobile device
The API will provide a method for a mobile device to 'check-out' a block of numbers for use, and the ability to check them back in again:
- The device requests a new block of IDs
- The EM creates a new update transaction on the database, updates the sequence table, and creates an entry in the sequence reserve table for the allocated range with a token which is returned to the device
- The device uses the allocated range when creating new items
- If the device runs out of IDs the user is warned that subsequent IDs may be duplicated
- On syncing, the device tells the API what ID it allocated and the token of the reserved sequence
- On inserting, the EM uses the sequence reserve table instead of the main sequence table to track the IDs
- If the device attempts to use IDs outside the reserved sequence then and those IDs are free, then they will be allowed, otherwise the updates will be rejected/quarantined
- When the sync is finished, the user will be asked if they want to recycle any left-over IDs, or if they want more allocated
Chains
Chains are a technique in ARK for storing hierarchical tree data in relation form. This is done using an adjacency table method. This is a problem in ARK2 for a number of reasons:
- Knowledge of when data is stored in chains is held solely in the subform code that creates or reads the chain, which causes issues when the schema will need to represent the data structure without the subform
- Chains are an internal implementation detail, their existence should not 'leak' into the schema or api for external clients, they must be free to choose their own storage solution
- Access to chains can be slow and inefficient, especially walking down a tree when you don't know how 'wide' it is (i.e. what data fragments it has as descendants).
The data schema will not represent data as chains, trees or graphs. Instead the schema will merely represent the inherent hierarchical structure of the data using groups/objects/lists. The data persistence layer will know to persist those repeating groups in its model. In the case of the current ARK SQL database this means repeating groups will need to be stored as chains.
A number of other techniques exist to make reading/writing of trees faster, such as Nested Sets. The Closure Table technique has been chosen for a number of reasons:
- Is a simple extension to the existing Adjacency Table method
- Is fast to both read and write
- An entire tree can be read with a single SQL query
- Supports storing DAGs, i.e. Harris Matrix
The proposed implementation will be:
- Chains may be renamed as Graphs?
- Nodes and Edges are stored separately
- The Nodes in a Graph are either Items or Fragments, but not both
- The Nodes continue to be stored in their current tables
- Edges are stored in a new ark_data_graph table
- Item Graphs store their itemkey/itemvalue Edges in the ark_data_graph table
- Fragment Graphs store their table/id Edges in the ark_data_graph table
- Descendent Fragment Nodes will no longer be keyed by their direct parent table/id, but will instead be keyed by their root itemkey/itemvalue. This allows faster access to all nodes, and means the nodes never need to be updated whenever the graph is altered.
- Root Fragment Nodes may need to carry a flag to indicate they are a root, otherwise a lookup is required every time on the graph table.
New code will be needed to create and maintain graphs in place of the current chain code. Migration code will be required to build the graph for existing data and update the nodes.
References:
- SQL Anti-patterns book
- http://technobytz.com/closure_table_store_hierarchical_data.html
- http://people.apache.org/~dongsheng/horak/100309_dag_structures_sql.pdf
- http://timwi.blogspot.co.uk/2010/03/query-tree-structures-and-dags-in-sql.html
- http://dirtsimple.org/2010/11/simplest-way-to-do-tree-based-queries.html
- http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/
- https://github.com/Atlantic18/DoctrineExtensions/blob/master/doc/tree.md
- https://github.com/EspadaV8/ClosureTable