Sunday, 25 September 2011

Magic of window functions in PostgreSQL

Once I had an opportunity to provide a solution generating "everyday" statistic of game service usage by users perfectly presented by the maximum amount of game points each user had every 24 hours. The most convenient way to show that data anyone who concerns is a graphical chart showing variations of gained points within predefined ranges.

Historically the only raw data we had at the moment was the information of current number of points every time user is logged in.
The interesting part was that user does not log in every day and there was no way simple to present points flow for specified period as we wanted as there was likely to be a lot of zero values. The most suitable behaviour for this chart was to show last available value from the preceding days to fill the gaps. The easiest solution involving less coding and giving potential of reusing it in other services was PostgreSQL window functions.

First of all, a short explanation what these "window functions" are for. There is a special syntax allowing to define particular windows over data and in this widows use of aggregate or window functions, which are "aggregate" as well with a regard of usage only with window definition. For example, this query allows to get the incremental (cumulative) sum of values easily:
SELECT *, sum(value) OVER (ORDER BY "date") FROM "table" ORDER BY "date";
To get a row number, row columns and the first time of logging in for each user in one query we execute this query:
SELECT row_number() OVER () AS row_number, *, min(login_tm) AS first_user_login OVER(PARTITION BY user_id) FROM "table" ORDER BY row_number;
Performing a query with aggregate functions in PostgreSQL we have to either use keywords GROUP BY or create window definition. Here is the more elaborate example getting the last value of points for every user with some unusual syntax shortly described below:
SELECT *, last_value(points) OVER (PARTITION BY user_id ORDER BY date RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS user_points  FROM "table";
If you try to use last_value() without RANGE definition - you get value of current row, not last value in window.

Well, enough theory for today and let's do some practice.

We start with creation of the table for keeping statistics and then continue with data generation to work with. It's going to be totally random numbers for different users with primary key user_id started with 10, for example. All dates will be in future, like we only suppose that user will have this points at that days :)
CREATE TABLE statistics (
    user_id integer NOT NULL,
    date date NOT NULL DEFAULT now(),
    points bigint NOT NULL,
    CONSTRAINT statistics_pkey PRIMARY KEY (user_id, date)
);

CREATE INDEX statistics_user_id_and_date_idx
    ON statistics
    USING btree (user_id, date DESC NULLS LAST);
INSERT INTO statistics
SELECT user_id, date, points FROM 
(
WITH users AS 
(SELECT * FROM generate_series(10, (random() * 10 + 100)::int, 
(random() * 5 + 2)::int) as user_id)
SELECT user_id, generate_series(now() + interval '1 day' * (random() * 2), now() + interval '1 day' * (random() * 100 + 10)::int, interval '1 day' * (random() * 4 + 1))::date as date, (random() * 100000)::int as points FROM users) as generated;
Once these queries are there are up to 1000 rows containing different random points, look like this:

user_iddatepoints
...
542011-11-1468325
542011-11-1540593
542011-11-1643914
562011-09-2129881
562011-09-2661834
562011-10-0133112
...

As soon as data is prepared we can start to fetch particular user logging in dates and points we're interested in. As the first step we fetch all the rows for a particular user, let's say with user_id = 88.
SELECT * FROM statistics WHERE user_id = 88 and date >= date_trunc('d', now()) and date < date_trunc('d', now() + interval '1 month');
Now, we need to generate rows for all the days of chosen period. We use maximum value of date as upper bound for a query.
WITH stats AS (SELECT * FROM statistics WHERE user_id = 88 and date >= date_trunc('d', now()) and date < date_trunc('d', now() + interval '1 month'))

SELECT generated_date FROM generate_series(date_trunc('d', now()), (select max(date) from stats), interval '1 day') as g(generated_date);
Next, we join all the available values from selected rows.
WITH stats AS (SELECT * FROM statistics WHERE user_id = 88 and date >= date_trunc('d', now()) and date < date_trunc('d', now() + interval '1 month'))

SELECT generated_date::date as date, stats.points FROM generate_series(date_trunc('d', now()), (select max(date) from stats), interval '1 day') as g(generated_date)
LEFT JOIN (SELECT points, date FROM stats) as stats
ON stats.date = g.generated_date; 
The result looks like this:

datepoints
2011-11-1468325
2011-11-1540593
2011-11-16
2011-11-17
2011-11-1861834
2011-11-19

Now, we desperately want to set something in these empty cells. First we take the points available before date without points. Here we need the window functions to make it possible. We enumerate every non-empty cell with the aggregate function count(column).
WITH stats AS (SELECT * FROM statistics WHERE user_id = 88 and date >= 
date_trunc('d', now()) and date < date_trunc('d', now() + interval '1 month'))

SELECT generated_date::date as date, stats.points, (count(date) OVER(ORDER BY generated_date))::integer FROM 

generate_series(date_trunc('d', now()), (select max(date) from stats), interval '1 day') as g(generated_date)

LEFT JOIN (SELECT points, date FROM stats) as stats ON stats.date = g.generated_date; 
This query produces the following results:

datepointscount
2011-11-14683251
2011-11-15405932
2011-11-162
2011-11-172
2011-11-18618343
2011-11-193

As you can see the counter increments only on filled cells. Let's fill the gaps in "points" column with the window function nth_value(value, position).

WITH stats AS (SELECT * FROM statistics WHERE user_id = 88 and date >= date_trunc('d', now()) and date < date_trunc('d', now() + interval '1 month'))

SELECT date, nth_value(points, count) OVER (ORDER BY date) FROM

(SELECT generated_date::date as date, stats.points, (count(date) OVER(ORDER BY generated_date))::integer as count FROM
generate_series(date_trunc('d', now()), (select max(date) from stats), interval '1 day') as g(generated_date)
LEFT JOIN (SELECT points, date FROM stats) as stats ON stats.date = g.generated_date) as pcount;
After execution of this query we receive results we wanted!

datepoints
2011-11-1468325
2011-11-1540593
2011-11-1640593
2011-11-1740593
2011-11-1861834
2011-11-1961834

Now you might pack up this complex SQL into a stored procedure and use it with different values of user_id and/or period.

No comments:

Post a Comment