Thursday 2 February 2012

How to optimize search through 1 Terabyte in PostgreSQL

Today I wanted to share my experience of using a relatively big database. It's an active running database containing near 1 TB of data managed by RDBMS PostgreSQL 9.0. The most significant part of the data is located in only 2 tables, 650 GB and 350 GB respectively. There are lots of queries populating into these tables - nearly 15-20 inserts per second, however there are just few selects per minute. Each row in either table contains up to 50 KB of data. Additionally there is a single cleaning query starting once a day in order to remove oldest data and prevent excessive database growing. Well, let's take a look under the hood.
First of all I would like to give more detailed view of DB model and tables. Database model was designed from scratch to be write-lockless which means that the database must allow performing steady continuous data writing regardless if there is any necessity of queries causing long table write lock or degrading write speed. PostgreSQL has a lot of maintenance operations that doesn't use exclusive locks under data writing or reading, even indexes creation; but there are still few type of queries that can cause it.

Thus, to be able to maintain database whenever it needs and not to stop a data insertion flow, I decided to use some tricks from the PostgreSQL ddl partitioning. This type of a data partitioning is using special features in PostgreSQL. These are "rules" and "table inheritance". Following sentences give a brief explanation about how data are moving from the beginning to the end:
  • The services that are connected to database perform INSERT queries to the table logs.
  • There are to additional tables in the database, named logs_check_true and logs_check_false. These tables are inherited from the main table logs, so they have same set of columns as a parent table, and every SELECT query under the main table automatically fetches data from all inherited tables, with certain condition (if any presents), and append it to the result.
  • DBMS rewrite every INSERT query to another query that inserts data to descendant tables. System checks condition in every rules for this main table and uses first complement. In our case condition  column check contains boolean value and main table has two rules for each column value, that approach allows to physically keep main table empty and write all data to inheried tables. So, if we will ever need to perform any operation under descendant table and that will cause write exclusive lock - we'll just disable rules, therefore all data inserting will go to the main table and we'll be able to perform that operation. 
CREATE TABLE logs (
    id bigserial NOT NULL,
    "check" boolean DEFAULT false NOT NULL,
    "time" timestamp without time zone DEFAULT now() NOT NULL,
    log_data xml NOT NULL,
    ids integer[] NOT NULL,
    PRIMARY KEY (id)
);

CREATE TABLE logs_check_false (
    PRIMARY KEY (id),
    CONSTRAINT non_present_check CHECK (("check" = false))
) INHERITS (logs);

CREATE TABLE logs_check_true (
    PRIMARY KEY (id),
    CONSTRAINT non_present_check CHECK (("check" = true))
) INHERITS (logs);

CREATE RULE check_false_rule AS 
ON INSERT TO logs 
WHERE (new."check" = false) 
    DO INSTEAD
    INSERT INTO logs_check_false (id, "check", "time", logs_data, ids) 
    SELECT new.id, new."check", new."time", new.logs_data, new.ids;

CREATE RULE check_true_rule AS 
ON INSERT TO logs 
WHERE (new."check" = true) 
    DO INSTEAD
    INSERT INTO logs_check_true (id, "check", "time", logs_data, ids) 
    SELECT new.id, new."check", new."time", new.logs_data, new.ids;


As I mentioned earlier, number of rules equals number of inherited tables and based on amount of values in chosen column check. Thus, we have an opportunity to switch off any rule at selected time and all data, that affected by this rule condition, will be written to main table that allow us, for instance, to reindex the table. After reindexing we'll just switch on the rule and move data from main table to descendant. Table will be automatically chosen by rules' conditions.
ALTER TABLE logs DISABLE RULE check_true_rule; 
-- Here we performing additional commands
ALTER TABLE logs ENABLE RULE check_true_rule;

-- BEFORE 9.1 we had to use two commands in transaction to move data to descendant table
BEGIN;
INSERT INTO logs SELECT * FROM ONLY logs;
DELETE * FROM ONLY logs;
COMMIT;

If you have PostgreSQL version 9.1 or above, than you can delete data from the main table and insert it into a descendant tables using just one query:
WITH deleted AS (DELETE FROM logs RETURNING *)
INSERT INTO logs SELECT * FROM deleted;

I have to note that all inherited tables are independent things and they contain own indexes, triggers, primary key and so forth. Thereof we must create these objects for each inherited table.

The major column, which is used in almost all selection queries in this database, is the column named ids. As you can see from the table creation query, a type of this column is array of integer values. In my
case, size of any array inserted in this column will never be more then 10; allowed range of values in this array is very huge: value can be any number from 0 to 2'147'484'647 (the largest integer value in PostgreSQL), but in my production database it was never more than several millions. As I mentioned before, that major column used in almost all selection queries, so, obviously, it must be indexed. But what type of index should we use?

There are 4 index types in PostgreSQL. We can't use btree or hash, because that types doesn't support  values in array indexing - they index only a full array - a particular sequence of certain values for each array. That  index types are useful only when you search for a particular set of values in certain order, all other searches, allowed for this column type, will cause Sequence Scan, which will rummage entire table and it will be a very slow operation. But, PostgreSQL has additional index types, these are gin and gist. They are frequently used for a Full-Text Searching, and are able to index values in array. These types considered to be very effective in queries like "find all rows where array in column ids contains some value(s) in any position".

So, what the difference between these two index types, which are seem to do the same at first sight? Brief explanation is that gist is 3 times faster than gin on data inserting, but gin is 3 times faster on data searching. Also, there is an principal distinction between these index types, I'll mention about that distinction later.

With obtained information about index types possibilities, their performance and advantage from using each type, I decided to index the column ids with gin index type, because there aren't so much data inserts to this column per each query and 3-times speeding up (vs. gist index) for selects is a great benefit. There are the index creating queries for existing tables:
CREATE INDEX rdtree_logs_idx ON logs USING gin (ids) WITH (fastupdate=off);
CREATE INDEX rdtree_logs_check_false_idx ON logs_check_false USING gin (ids) WITH (fastupdate=off);
CREATE INDEX rdtree_logs_check_true_idx ON logs_check_true USING gin (ids) WITH (fastupdate=off);

As you can see, the indexes are created with additional parameter fastupdate, which is deliberately turned off. This parameter is enabled by default. It makes DBMS to keep in memory all data that was inserted to that index and write down that data to filesystem only when index will be vacuumed - in other words, only when someone manually or background demon autovacuum automatically will start VACUUM query upon that index, to physically remove data that marked for deletion. Parameter fastupdate turned on by default to speed up data insertion and updation, because it helps to reduce gin index rebuilding count, but in our particular circumstances, where we have no any updates, only continuous inserts and rare deletes - this feature is useless and it can only cause problems, such as search speed degradation. More information about "gin fast update technique" you can find in the official documentation.

Well, in first view, while we have about 1-2 Gb of data, we can see a good performance in our queries "find all rows where array ids contained value N":
EXPLAIN ANALYZE SELECT * FROM logs WHERE ids @> ARRAY[1];
Result  (cost=0.00..74.87 rows=17 width=8) (actual time=21.045..21.273 rows=19 loops=1)
  ->  Append  (cost=0.00..74.87 rows=17 width=8) (actual time=21.043..21.259 rows=19 loops=1)
        ->  Seq Scan on public.logs  (cost=0.00..0.00 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=1)
              Filter: (public.logs.ids @> '{1}'::integer[])
        ->  Bitmap Heap Scan on public.logs_check_false logs  (cost=12.06..37.44 rows=8 width=8) (actual time=21.040..21.099 rows=7 loops=1)
              Recheck Cond: (public.logs.ids @> '{1}'::integer[])
              ->  Bitmap Index Scan on rdtree_logs_check_false_idx  (cost=0.00..12.06 rows=8 width=0) (actual time=21.014..21.014 rows=7 loops=1)
                    Index Cond: (public.logs.ids @> '{1}'::integer[])
        ->  Bitmap Heap Scan on public.logs_check_true logs  (cost=12.06..37.44 rows=8 width=8) (actual time=0.047..0.143 rows=12 loops=1)
              Recheck Cond: (public.logs.ids @> '{1}'::integer[])
              ->  Bitmap Index Scan on rdtree_logs_check_true_idx  (cost=0.00..12.06 rows=8 width=0) (actual time=0.033..0.033 rows=12 loops=1)
                    Index Cond: (public.logs.ids @> '{1}'::integer[])
Total runtime: 21.345 ms

But, after few days of continuous data insertion I noticed that a search execution time was becoming increasingly slow. After short investigation I found what caused this problem. Essentially, more than 70% of search queries took stable amount of execution time, and only some queries took a lot of time to complete. I looked a little closer to figure out what was happening with this queries and I found interesting issue - database received all matched rows from inherited table before applying a LIMIT condition to them. DBMS choose that approach in the searching queries which are using result ordering. A reverse ordering causes database to fetch all matched rows from descendant tables and afterward to join received results with applying a LIMIT condition.
EXPLAIN ANALYZE SELECT * FROM logs WHERE ids @> ARRAY[1] ORDER BY id DESC LIMIT 30;
Limit  (cost=120.91..120.98 rows=30 width=8) (actual time=207.729..207.755 rows=30 loops=1)
  ->  Sort  (cost=120.91..120.99 rows=31 width=8) (actual time=207.725..207.733 rows=30 loops=1)
        Sort Key: public.logs.id
        Sort Method: top-N heapsort  Memory: 18kB
        ->  Result  (cost=0.00..120.14 rows=31 width=8) (actual time=50.171..203.421 rows=6230 loops=1)
              ->  Append  (cost=0.00..120.14 rows=31 width=8) (actual time=50.169..198.798 rows=6230 loops=1)
                    ->  Seq Scan on public.logs  (cost=0.00..0.00 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=1)
                          Filter: (public.logs.ids @> '{1}'::integer[])
                    ->  Bitmap Heap Scan on public.logs_check_false logs  (cost=12.12..60.07 rows=15 width=8) (actual time=50.167..82.187 rows=3133 loops=1)
                          Recheck Cond: (public.logs.ids @> '{1}'::integer[])
                          ->  Bitmap Index Scan on rdtree_logs_check_false_idx  (cost=0.00..12.11 rows=15 width=0) (actual time=42.088..42.088 rows=3133 loops=1)
                                Index Cond: (public.logs.ids @> '{1}'::integer[])
                    ->  Bitmap Heap Scan on public.logs_check_true logs  (cost=12.12..60.07 rows=15 width=8) (actual time=64.068..112.229 rows=3097 loops=1)
                          Recheck Cond: (public.logs.ids @> '{1}'::integer[])
                          ->  Bitmap Index Scan on rdtree_logs_check_true_idx  (cost=0.00..12.11 rows=15 width=0) (actual time=41.335..41.335 rows=3097 loops=1)
                                Index Cond: (public.logs.ids @> '{1}'::integer[])
Total runtime: 227.994 ms

Why do this issue happens? It's related to using Bitmap Index/Heap Scan over gin index by query planner. A Bitmap Scan usage causes PostgreSQL to perform two stages of gathering information about matched rows from the index. The first stage "Bitmap Index Scan" is used to get information about data location from the index. The second stage "Bitmap Heap Scan" is used to check every row received from the index that it really contains searching value. It is a good approach when an index returns 10-20 rows per every condition, and these rows are physically located in only few files of database storage (all database objects in PostgreSQL have own file with unique filename and each object is sliced into 1Gb segments). But, when query receive twenty rows located in same value of different files - you will get search speed degradation. In upper example all rows located in one file, that's why query execution time took less than 300ms.

According to received information I decided I woulds try to use gist index, because it allows to use simple Index Scan over itself. But, when I had created this index I found annoying behavior of it. I should mention again that our tables have a special type of indexed column - array that could contains values in huge range - up to 10 millions unique numbers. In this case our that gist index had size more than size of overall table! I have found information in PostgreSQL developers forum that if you would try to create gist index over arrays with values in a wide range - you must use special extension to have possibility to comfortable work with it. Undoubtedly this type of index does not relative to our particular data.

So, it was necessary to make number of rows returned from the index as small as possible. That approach would increase chance to fetch needed amount of records from only one file and decrease amount of sorting rows. There was only one ability to make it possible - index ought to have extra condition to distinguish different records properly. In our case it is a column time, with timestamp within it. Little investigation has showed that indexing '1 hour' periods of time would be enough because there are maximum 50 rows for certain identifier in any hour. A first wish was to create multicolumn index, but gin and gist indexes are not able to support "simple" types of data, by default. However, PostgreSQL has ability to combine result sets that are returned by different indexes, even those have disparate types and allows to create functional indexes. Therefore I created additional btree index that uses result returned from function date_trunc('hour', "time"). This function returns timestamp from column time with required precision equals 1 hour. After that index has been created the database got an opportunity to "slice" main index to certain pieces using Bitmap Index Scan over gin index and Index Scan over btree index and afterward use BitmapAnd operation over received result sets. That approach allows database to fetch short amount of data with Bitmap Heap Scan.

CREATE INDEX btree_logs_idx ON logs USING btree (date_trunc('hour', "time"));
CREATE INDEX btree_logs_check_true_idx ON logs_check_true USING btree (date_trunc('hour', "time"));
CREATE INDEX btree_logs_check_false_idx ON logs_check_false USING btree (date_trunc('hour', "time"));
EXPLAIN ANALYZE SELECT * FROM logs WHERE ids @> ARRAY[1] and date_trunc('hour', "time") = date_trunc('hour', now() - interval '1 month 1 week')::timestamp;
 Result  (cost=0.00..43.72 rows=3 width=88) (actual time=62.320..120.507 rows=27 loops=1)
   ->  Append  (cost=0.00..43.72 rows=3 width=88) (actual time=62.318..120.479 rows=27 loops=1)
         ->  Seq Scan on public.logs  (cost=0.00..0.00 rows=1 width=81) (actual time=0.001..0.001 rows=0 loops=1)
               Filter: ((public.logs.ids @> '{1}'::integer[]) AND (date_trunc('hour'::text, public.logs."time") = (date_trunc('hour'::text, (now() - '1 mon 7 days'::interval)))::timestamp without time zone))
         ->  Bitmap Heap Scan on public.logs_check_false logs  (cost=17.83..21.86 rows=1 width=91) (actual time=62.315..66.031 rows=13 loops=1)
               Recheck Cond: ((date_trunc('hour'::text, public.logs."time") = (date_trunc('hour'::text, (now() - '1 mon 7 days'::interval)))::timestamp without time zone) AND (public.logs.ids @> '{1}'::integer[]))
               ->  BitmapAnd  (cost=17.83..17.83 rows=1 width=0) (actual time=50.456..50.456 rows=0 loops=1)
                     ->  Bitmap Index Scan on btree_logs_check_false_idx  (cost=0.00..5.30 rows=138 width=0) (actual time=9.465..9.465 rows=161 loops=1)
                           Index Cond: (date_trunc('hour'::text, public.logs."time") = (date_trunc('hour'::text, (now() - '1 mon 7 days'::interval)))::timestamp without time zone)
                     ->  Bitmap Index Scan on rdtree_logs_check_false_idx  (cost=0.00..12.28 rows=38 width=0) (actual time=40.969..40.969 rows=3380 loops=1)
                           Index Cond: (public.logs.ids @> '{1}'::integer[])
         ->  Bitmap Heap Scan on public.logs_check_true logs  (cost=17.83..21.86 rows=1 width=92) (actual time=49.210..54.416 rows=14 loops=1)
               Recheck Cond: ((date_trunc('hour'::text, public.logs."time") = (date_trunc('hour'::text, (now() - '1 mon 7 days'::interval)))::timestamp without time zone) AND (public.logs.ids @> '{1}'::integer[]))
               ->  BitmapAnd  (cost=17.83..17.83 rows=1 width=0) (actual time=38.773..38.773 rows=0 loops=1)
                     ->  Bitmap Index Scan on btree_logs_check_true_idx  (cost=0.00..5.30 rows=138 width=0) (actual time=18.074..18.074 rows=164 loops=1)
                           Index Cond: (date_trunc('hour'::text, public.logs."time") = (date_trunc('hour'::text, (now() - '1 mon 7 days'::interval)))::timestamp without time zone)
                     ->  Bitmap Index Scan on rdtree_logs_check_true_idx  (cost=0.00..12.28 rows=37 width=0) (actual time=20.677..20.677 rows=3356 loops=1)
                           Index Cond: (public.logs.ids @> '{1}'::integer[])
Total runtime: 144.265 ms

While I was trying to find a way to improve select's queries my eye catched an interesting article from the one of the developers that has implemented and continue to work under gin/gist index types in PostgreSQL. In this article Oleg Bartunov presented advantage from using multicolumn gin indexes with an extension named btree_gin. This extension allows to emulate btree in gin. Essentially, it provides simple data types usage in gin index and let database to query that index with simple conditions, like '=', '<', '>'. Encouraged with his presentation I decided to keep only one main index per each table, except primary key index of course. It was not an easy way to install any extension into PostgreSQL database before version 9.1 - it was neccessary to find required *.sql file (btree_gin.sql in our case) in special contrib directory and afterward execute all queries from that file in selected database.

The recent PostgreSQL version allows to install extensions much easier. You need to just execute following query in chosen database and the selected extension will be automatically installed:
CREATE EXTENSION btree_gin
With btree_gin extension I've got a possibility to create multicolumn index with the column 'ids', which has a type integer[], and with the result returned by function date_trunc('hour', "time"), which has a simple type timestamp.
CREATE INDEX rdtree_multicolumn_logs_idx ON logs USING gin (ids, date_trunc('hour', "time"));
CREATE INDEX rdtree_multicolumn_logs_check_true_idx ON logs_check_true USING gin (ids, date_trunc('hour', "time"));
CREATE INDEX rdtree_multicolumn_logs_check_false_idx ON logs_check_false USING gin (ids, date_trunc('hour', "time"));
When indexes were being created PostgreSQL began to use only them in all selecting queries with both conditions. Multicolumn gin indexes are more effective because they doesn't cause database to use BitmapAnd between two different indexes. In my case database always chooses Bitmap Index/Heap scan to retrieve rows from database, but in article by Oleg Bartunov, that I mentioned earlier, author was able to force DBMS to avoid of using BitmapScan and use IndexScan instead. I suppose that it's linked to using in gin index column with ts_vector.
EXPLAIN ANALYZE SELECT * FROM logs WHERE ids @> ARRAY[1] and date_trunc('hour', "time") = date_trunc('hour', now() - interval '1 month 1 week')::timestamp;
 Result  (cost=0.00..48.08 rows=3 width=88) (actual time=58.910..100.429 rows=27 loops=1)
   ->  Append  (cost=0.00..48.08 rows=3 width=88) (actual time=58.907..100.370 rows=27 loops=1)
         ->  Seq Scan on public.logs  (cost=0.00..0.00 rows=1 width=81) (actual time=0.001..0.001 rows=0 loops=1)
               Filter: ((public.logs.ids @> '{1}'::integer[]) AND (date_trunc('hour'::text, public.logs."time") = (date_trunc('hour'::text, (now() - '1 mon 7 days'::interval)))::timestamp without time zone))
         ->  Bitmap Heap Scan on public.logs_check_false logs  (cost=20.01..24.04 rows=1 width=91) (actual time=58.903..62.750 rows=13 loops=1)
               Recheck Cond: ((public.logs.ids @> '{1}'::integer[]) AND (date_trunc('hour'::text, public.logs."time") = (date_trunc('hour'::text, (now() - '1 mon 7 days'::interval)))::timestamp without time zone))
               ->  Bitmap Index Scan on rdtree_multicolumn_logs_check_false_idx  (cost=0.00..20.01 rows=1 width=0) (actual time=30.275..30.275 rows=13 loops=1)
                     Index Cond: ((public.logs.ids @> '{1}'::integer[]) AND (date_trunc('hour'::text, public.logs."time") = (date_trunc('hour'::text, (now() - '1 mon 7 days'::interval)))::timestamp without time zone))
         ->  Bitmap Heap Scan on public.logs_check_true logs  (cost=20.01..24.04 rows=1 width=92) (actual time=32.400..37.579 rows=14 loops=1)
               Recheck Cond: ((public.logs.ids @> '{1}'::integer[]) AND (date_trunc('hour'::text, public.logs."time") = (date_trunc('hour'::text, (now() - '1 mon 7 days'::interval)))::timestamp without time zone))
               ->  Bitmap Index Scan on rdtree_multicolumn_logs_check_true_idx  (cost=0.00..20.01 rows=1 width=0) (actual time=24.078..24.078 rows=14 loops=1)
                     Index Cond: ((public.logs.ids @> '{1}'::integer[]) AND (date_trunc('hour'::text, public.logs."time") = (date_trunc('hour'::text, (now() - '1 mon 7 days'::interval)))::timestamp without time zone))
 Total runtime: 123.613 ms
Finally I've created a stored procedure in pl/pgsql. It can fetch required amount of rows from the tail of using tables for certain id, considered to be in ids column. Procedure starts to search from the current time (or from a time in the row, which primary identifier received from the parameters), than procedure decreases that time and search again until it finds required amount of rows or hits the lower boundary.
Stored procedure creation query:
CREATE FUNCTION fetch_last_logs(id integer, bigint DEFAULT 0, integer DEFAULT 20) RETURNS SETOF logs
    LANGUAGE plpgsql ROWS 20
    AS $_$
DECLARE
        log logs%ROWTYPE;
        log_hour timestamp = date_trunc('hour', now());
        interval_limit interval = '1 month';

        res_limit ALIAS FOR $3;

        offset_id ALIAS FOR $2;
BEGIN

IF offset_id &lt;= 0 THEN
        SELECT nextval('logs_id_seq') INTO offset_id;
ELSE
        SELECT date_trunc('hour', time) FROM logs WHERE id = offset_id INTO log_hour;
END IF;

LOOP
        FOR log IN SELECT * FROM logs WHERE ids @> ARRAY[id] and date_trunc('hour', time) = log_hour and id &lt; offset_id order by id desc limit res_limit
        LOOP
                res_limit := res_limit - 1;
                interval_limit = '1 month';
                RETURN NEXT log;
        END LOOP;

        log_hour := log_hour - interval '1 hour';

        IF NOW() - log_hour > interval_limit OR res_limit <= 0 THEN EXIT; END IF;
END LOOP;
RETURN;
END;
$_$;

This procedure allows to receive the latest rows with certain identifiers just for a few seconds maximum even from database that contains 1Tb of data.

No comments:

Post a Comment