After agreeing to write this blog post (I say ‘agreeing’, but it’s not as if my boss actually gave me a choice), I said to myself that I would try not to write some formal, dry college-assignment-like exposition of the topic by simply suturing together paragraphs from various websites. Instead, I would try to share the theoretical concepts along with our practical experience here in t-matix. This being my first blog post or wannabe-article of any kind, don’t judge me too harshly.
Time series databases (TSDBs) have been the fastest growing segment of the database industry for the last couple of years. They are optimized for storage and retrieval of ‘time-stamped’ data as well as for performing various time-based analytic functions. Data should mostly be sent to them in-order and should mostly be append-only (shouldn’t require many updates, just inserts). If this is not the case, it can negatively affect performance.
Row- vs. column-oriented databases
Most TSDBs store values of a single column contiguously one after another on a file system (they are referred to as column-oriented). In contrast, a majority of traditional relational databases (RDBMSs) used for transactional systems (Postgres, MySQL, Oracle, etc.) store data in rows, whereby one row is stored after another (they are referred to as row-oriented). As it is often the case in life – neither approach is universally better.
Image 1. Storing of data in row- vs. column-oriented manner (taken from https://www.dbbest.com/blog/column-oriented-database-technologies/)
Values in one column are often similar or exactly the same in the contiguous column segments, so they can be efficiently compressed in column-oriented databases. Think of temperature sensor readings – they can be performed every second and not change at all for comparatively long periods of time. Besides the obvious disk space thriftiness, this enhances write and read throughput as well due to lower IO requirements. Why are all databases not column-oriented then? Reading data column by column is not so efficient if you need to fetch data from a large number of columns of the same row, as is often the case in traditional transactional RDBMSs.
TSDBs that store data in a column-oriented manner have additional layers of complexity in their data storage model, which I won’t go into here for the sake of brevity.
The database (or rather database plugin) that I am recommending here is TimescaleDB. It is a plugin for PostgreSQL that can be installed hassle-free on your existing PostgreSQL database. You just need to follow a few easy steps, which I’ll be listing after the article (courtesy of our Sys-admin Martina).
Once you install it, nothing really changes in terms of schemas, tables, indexes, etc. However, you can now easily make a TimescaleDB table (the official term is hypertable) out of your existing table or any new table that you create. If it is not empty, additional steps need to be made, which you can find in the official documentation.
Creating an empty table is as easy as (random table structure):
‘create_hypertable’ function turns the ordinary table ‘tbl_xyz’ into a hypertable. The first argument is the name of the table and the second is the name of the timestamp column that you want to use as timestamp for the table (this being a table containing time series data). The name of the table stays the same and no new table is created.
Similar instructions can easily be found on the TimescaleDB’s main page, but I am listing them here to show how easy it is to convert an ordinary table into a hypertable. Once a conversion into a hypertable is done, you can perform CRUD operations as you would on any ordinary table. You can even create additional indexes (timestamp column index is created automatically). However, querying and inserting copious amounts of time series data from/into it should now go faster. It won’t make much difference if you are not working with a lot of data.
The increased performance is a result of the TimescaleDB’s behind-the-scenes splitting of hypertables into partitions (called chunks) by the timestamp field. Indexes are then stored per-partition and a complete index for the whole table doesn’t need to be loaded into RAM and modified as new records are inserted. What is more, it’s possible to assign chunks to different tablespaces, thus enabling storage of a single table’s data onto multiple disks. It is worth repeating that all of this is done behind the scenes and from the user’s perspective, each hypertable behaves just like an ordinary table. You can only differentiate between them by using special TimescaleDB’s functions.
TimescaleDB also adds some time-series-based functions that we find very useful, such as ‘first’ or ‘last’. They enable you to get the value of one attribute for the same row in which another attribute has the maximum or minimum value after “group by” is applied. Without them you’d probably have to resort to using PostgreSQL’s window function partitioning, which is more verbose and thus harder to maintain.
For example, if you have a three-column table containing transaction_timestamp, customer_id and account_balance, and you want to get last account_balance per customer, you can do this by executing the following:
Without using timescaleDB’s ‘last’ function, you’d probably have to execute something like this:
Since TimescaleDB is a PostgreSQL plugin, it stores data in a row-oriented manner, which is not optimal for storing time series data. As a consequence, TimescaleDB data can often take up more disk space compared to other TSDBs (e.g. InfluxDB) and require more IO operations. On the flip side, since TimescaleDB is only a PostgreSQL plugin, your hypertables can peacefully coexist with other tables that are also part of your model, and all operations can be performed on or between (joins) any of them. Also, since PostgreSQL has existed for more than two decades, it is reasonable to assume that it is much more stable and that system admins are well acquainted with its deployment, maintenance and tuning.
There are many benchmarks available online comparing performance of TimescaleDB vs. other TSDBs. It is hard to draw a consistent conclusion of its performance other than the aforementioned disk space usage. I guess your best bet is to do the benchmarks yourself with the load and data model similar to what you expect to have in production.
Here in t-matix we have been using TimescaleDB successfully for some time and haven’t experienced many problems. There was one occasion when the whole PostgreSQL instance was unrecoverably corrupted after the virtual machine running it was powered off, but we can’t be certain that this was caused by TimescaleDB. Plus, we had installed TimescaleDB on the ‘postgres’ database (the default postgres database) and this wasn’t something we’d have normally done. The only reason for that was the fact that it was our development database, and thus we didn’t follow the best practices while installing it.
We currently have tables containing close to a hundred million rows and we have been seeing no drop-off in write performance as table sizes increase while using TimescaleDB. The latter also enabled us to drop chunks of them based on the timestamp column without doing resource intensive batched delete statements.
If you are using PostgreSQL database in your application and need to store a large volume or handle high velocity of time series based data, consider using a TimescaleDB plugin. It is easy to configure, offers increased performance and new time series data handling functions, and will work seamlessly with other database objects that are part of your model.
December 18, 2019
Continuosly building, testing, releasing and monitoring t-matix mobile apps