Tuesday 19 February 2013

XML way of schemaless PostgreSQL

Since the moment I wrote my previous postgresql-praise-giving post I've been conceiving the idea of writing another one about how to use PostgreSQL in terms of managing data in a schemaless way. First it was just desire to met readers with internal "xml" data type and auxiliary XML-functions, but the latest PostgreSQL release (9.2) with brand new "JSON" feature literally made me mention it as well.

Any developer familiar with typical RDBMS might face the problem of having strict schema which could probably get solved faster using flexible one. Such problems might appear in different cases - a brand new software, like a social game, for instance, which need to have flexibility of storing data in rapidly changing environment; a legacy software, where it costs too much to rewrite software to support strict schema in relational database; even a simple program that's supposed to store data in plain machine-readable format.

For any of the previously described cases it's rather habitual to create the table with unique 'id' column and additional text column named, for example, 'document'. Then we're going to write down data there and, by demand, simply get it back. Although, in that case only your application is completely responsible for all data validation as long as it reads or writes it. Additionally, it's very likely that after some time your program might take more responsibility for extracting certain values from your 'document', indexing them in special third-party software which allows only getting pure identifiers supposed to be used in getting full document from the database. Developers tend to choose from two existing famous text-based standards to save unstructured data, because there are a number of powerful libraries facilitating work with either format. XML has a pack of powerful extensions being a standard: XPath, XSD, XSLT. JSON allows to keep data as the structures as sophisticated as it allows XML, moreover it takes less amount of storage, although it has no approved validation or transformation standards at the moment.

So, if you've found yourself somewhere in explanations above - you might like PostgreSQL's handful features to manage it. It's more than simple information storage and, in fact, it's able to dramatically decrease your time spent to code typing and tuning third-party tools. Well, what are those magic features which would help us with creating schemaless tables, regardless how odd it sounds? These are: an internal data type 'xml', a bunch of assistant functions, a 'json' type and support functions, functional and partial indexes, and, finally, the plv8 extension. I'm going to observe all of them in this post.

Ok, let's start with XML. There are numerous different approaches to work with this standard and we might get advantages by using them separately or together. First of all, PostgreSQL has an internal "xml" type, which is described here. Shortly, you're able to produce an entire XML document or just a fragment of it using  function xmlparse. In a comparison with "document" having only one "root" node, "fragment" can have more than one top-level node. Here's a simple example:
SELECT xmlparse(DOCUMENT 
'<?xml version="1.0"?><root><achievement id="1"><description/></item></root>');
SELECT xmlparse(CONTENT 
'<item id="1"><description>cute item<description/><class>jewelry</class></item>');
We can use a "document" if we are about storing full document or a "content" if it's better to assemble it later from different parts. It's worth noting that there is a convenient syntax sugar allowing to parse text to "xml" type using explicit postgres type-casting:
SELECT '<?xml version="1.0"?><rt><achievement id="1"><description/></item></rt>'::xml
To determine what's stored in the field you can use an expression xmlvalue IS DOCUMENT in your query.
There is a function producing text from xml named xmlserialize. Short example below:
xmlserialize(DOCUMENT xmlvalue AS text);
-- or the explicit type-cast version
xmlvalue::text
Now, as we know what "xml" type is let's continue with more interesting part - xml functions. The first set of functions worth mentioning are xml_is_well_formed* functions. While we have XML that's not well-formed - XPATH queries over it ain't going to work with it, I'm going to describe these queries a little bit later. Verification whether your XML document (or simple content) is valid looks like:
SELECT xml_is_well_formed(
'<?xml version="1.0"?><root><achievement id="1"><description/></item></root>');
SELECT xml_is_well_formed_document(
'<?xml version="1.0"?><root><achievement id="1"><description/></item></root>');
SELECT xml_is_well_formed_content('<achievement id="1">');
The XML which validity has not been approved by either function is not considered as a valid input by all the following functions.

These methods below are only to perform XML validation, another interesting question is how to produce a valid XML document with PostgreSQL. Let's start explanation with some simple functions to generate just some particular parts of XML.
Firstly, I want to tell about the xmlroot function, serving to prepend your own XML Document with the version header without latter document is not valid. The use of function looks like this:
SELECT xmlroot('<root/>'::xml, version '1.0', standalone yes);
and produces following text:
<?xml version="1.0" standalone="yes"?>
<root/>
Simple string concatenation approach to generate an XML document is not good practice so it's better to use the xmlconcat function. It has simple and predictable syntax to concatenate two or more xml documents together:
SELECT xmlconcat('<json/>'::xml, '<content/>'::xml, '<xml/>'::xml);
Gives us following result:
<json/><content/><xml/> 
-- obviously that's not a valid document
-- it must be wrapped with the only root element
Also, I'd like to depict the xmlforest function as a more convenient way to generate sequences of XML tags, that creates XML tags using data from the passed parameters, also it might replace xmlconcat in many cases.
SELECT xmlforest('unique' as "description", 'valuable' as "cost");
Generates:
<description>unique</description><cost>valuable</cost>
Another interesting function that's worth being thoroughly described - xmlelement. It has quite unusual syntax, so I took the description right from the PostgreSQL's documentation and renamed parameters to be more descriptive.
xmlelement(name element_name [, xmlattributes(value [AS attname] [, ... ])] [, content, ...])
Where bold elements are special words and italic are user-defined values.
I don't want to repeat everything written in official documentation, so I'd rather just give an example:
SELECT xmlelement(NAME ’achievement’, 
XMLATTRIBUTES(1 AS ’id’, 'welcome' as 'name'), 
xmlelement(NAME 'description’, ’superb’));
Gives us following text:
<achievement id="1" name="welcome"><description>superb</description></achievement>
There is no doubt it's a pretty powerful way to creatie complex XML tags, moreover there is additional power to leverage on - using this function's outcome as a content parameter for upper function call - that provides a lot of interesting opportunities. Sure, the result produced by this function might be used as a parameter for any functions mentioned earlier and it allows us not to follow troublesome string-concatenation approach.

Lastly, XML producing function I wanted to represent is xmlagg. It behaves as an aggregation function appending received parameters into one value. Combined with other functions, especially with ones used as "XML from text" producers, it gives huge potential of creating evolved XML document consists from different data using just in one query. To make it more clear I wanted to show an example of function usage.
SELECT xmlelement(name "table", 
xmlattributes(table_name as "name", table_schema as "schema"), 
xmlagg(xmlelement(name "column", xmlattributes(data_type as "type"), column_name)))
FROM information_schema.columns
WHERE table_schema = 'pg_catalog' and table_name = 'pg_attrdef'
GROUP BY table_schema, table_name;
We receive neat XML looks like:
<table name="pg_attrdef" schema="pg_catalog">
<column type="text">adsrc</column>
<column type="pg_node_tree">adbin</column>
<column type="smallint">adnum</column>
<column type="oid">adrelid</column>
</table>
Majority of internal PostgreSQL's functions provide XML has been described and now we're going to take a look to only one function used for processing XML named xpath. There is a number of places in a database where that function might be applied to. Let's get started with a short description - as a first argument it takes a XPATH expression and a second argument is a well formed XML document - if we supply it with non-valid document it returns nothing. Information about XPATH standard might be taken in the Internet, for example, in w3schools website. Well, where can this function be applicable? Of course in gathering specific information we want from documents stored in the database, but it's rather common and not so interesting way of using this function. I've found it pretty useful in functional indexes with the ability of indexing just particular data from documents. How about having an index only for values in an attribute "name" of a deeply nested tag "table" - let's go and do it!
CREATE INDEX xpath_functional_index ON documents_table 
USING btree(((xpath('//table[@name]'::text, document)::text[])[1]::text));
This index might be used accordingly in queries containing exactly the condition we used to create index, so a query must look like this:
SELECT * FROM documents_table 
WHERE (xpath('//table[@name]'::text, document)::text[])[1]::text = 'pg_attrdef';
Surely, it's a quite inconvenient way to get information we've asked for, however it's reasonably appropriate in many cases.
At this point I wanted to stop describing XML functions integrated into the PostgreSQL even I didn't cover all the XML functionality available. Now I'd like to describe brand new feature implemented in the last 9.2 version - JSON type.
Currently, as a part of this feature we've got only integrated data type json validated solely by DBMS as we're inserting data there. Additionally there two functions performing transformation of a row or a multidimensional array to the JSON object - array_to_json and row_to_json. These functions are pretty simple and even official documentation have a quite scarce description for them. Although, I could figure using just these two functions in a rather complex query allows us to create really elaborate JSON objects. Also in this part dedicated to JSON type it is worth mentioning plv8js extension created by the PostgreSQL's community and already trusted despite the fact it's pretty new and it's not a common way for database's creators. The purpose of this extension is creating stored procedures using JavaScript or few its dialects. In my case the extension helps me to fill the gaps in ways of using JSON in the database, it provides the possibilities of extracting certain values from JSON objects, explicit parsing and generating complex JSON objects in imperative way which is more clear than creating them using declarative SQL language. Let's take a look a little closer to this, undoubtedly, interesting extension. It's quite easy to install it in PostgreSQL's version 9.1 or higher, just type:
CREATE EXTENSION plv8;
Thence we can create stored procedures using installed language 'plv8'. The following example shows the basic approach to create such a procedure based on plv8:
CREATE FUNCTION json_string(input json, key text) RETURNS text AS $$$
  var data = JSON.parse(input); 
  return data[key];
$$$ LANGUAGE plv8 IMMUTABLE STRICT;
Using such a function simply returning value of the requested field you're able to create functional indexes over your data packed in JSON format to improve selection of particular data in your database:
CREATE INDEX json_functional_index ON json_documents_table
USING btree(json_string(document,'name'));
However, we should take into account that creating such a kind of indexes, using the XML or JSON parsing, might seriously impact your database's throughput at a high-load circumstances with lot of insertion transactions per second and I wanted to give a small hint allowing to manage influence of indexing operations in the database at peak time. This is about the PostgreSQL's ability of creating partial indexes. Basically it looks like adding special condition to the CREATE INDEX query, so the database is going to extract value and put it in the index only when the condition is true. We're able to explicitly indicate our wish to index particular rows by using auxiliary column named, for example, "indexed" with default value 'false'. Then we create an index with a condition we like:
CREATE INDEX json_functional_index ON json_documents_table
USING btree(json_string(document,'name'))
WHERE indexed = true;
Afterwards we are capable to handle the increasing number of writing transactions using the deferred indexing. Sure, it's a trade-off between speed of writing and reading actual data, but in the case when it's a highly loaded database with a requirement to receive fresh data ASAP, I'd advise to perform some premature computations before putting information to the database and store the result of computation in the additional indexed column.

No comments:

Post a Comment