We have been working on Stellar blockchain for quite some time. Recently, we got an opportunity to work on a POC (Proof of concept) to build payment infrastructure on blockchain. It was an experimental project for us to see if blockchain can be suitable for such systems of high load and throughput. We choose Stellar blockchain for this purpose. The reason we choose Stellar requires a separate blog post of itself and hence out of the scope of this post.

Once we built the basic infrastructure, to suit the business need, on top of DLT, we started testing the system on high load. The initial requirement was to achieve at least 1K transaction per second (TPS). Note, when I say transaction, it means bank-transaction and not Stellar transaction. One bank-transaction consist of 2 stellar operations in 2 separate ledgers. Which mean, what we were expecting was to get at least 2K operations per sec. Knowing the fact that Stellar's ledger close time is 5 sec and max_tx_set_size (i.e which is a maximum number of operations closes in every ledger) is 1000, the maximum throughput that we can get from stock Stellar is only, 200. Hence, we did a hard fork of Stellar to improve performance. Again, what all changes we did in Stellar core software, is out of the scope of this post - I will write a separate blog on that.

We started our experiment and was able to achieve around 400 TPS on 6 node network with 1 validator node. The problem came when we wanted to go beyond that figure. It looked like there was some bottleneck. Finding performance bottleneck is known as well as a challenging problem to work on. There is a lot of literature available online for that. Some of them which we found interesting was this,this. Normally, in such cases, what we do is, we diagnose every layer of the system (look at the figure below to understand the overall system) and try to find out where the problem is, and that is what we did.

Goal & expectation

Now that you got the background of what we were trying to do, let's understand our goal and expectations from this blog post. In this blog post, I will talk about our investigation of database layer in dlt. Especially with the Postgres database. We will talk about how did we improve the performance of the Postgres database which eventually affected the performance of the whole system. I intend to give a basic idea of TO-DOs to understand the problem and to improve the performance of Postgres database in general. (It has nothing to do with blockchain :p)

We want to learn the following:

  • What was the problem we noticed in our Postgres database configuration and how did we identify that problem?
  • How we resolved that problem?
  • What was the effect of resolution on the overall system?

The Problem

As I mentioned, we wanted to achieve high throughput but we felt that there is some bottleneck in the system. The problem could be anywhere, like hardware limitation, software limitation, network bandwidth, incorrect configurations, limitation of Stellar consensus, limitation with Postgres database etc. As long as the hardware is concerned, we were good since we had a decent machine of 4 cores Intel x86_64 with 16GiB RAM and 150GB SDD. There are many performance monitoring tools available but because of our infrastructure limitation, we can not take external tools there, we used standard Linux tool top to measure the overall hardware performance.

We decided to diagnose every layer of the software, starting from the database layer. Let's look at our approach (s) to find out the problem with Postgres database.

Standard Tool

We started our examinations with a simple tool like ps to identify individual server processes and see their health. The process with PID 24085 is the master process and rest are background worker processes, launched by the master process. Each of the remaining processes is a server process handling one client connection and displays in the form:

postgres: user database host activity

Activity

The user, database, and host items remain the same for the life of the client connection, but the activity indicator changes. The activity can be idle (i.e., waiting for a client command), idle in transaction (waiting for client inside a BEGIN block), or a command type name such as SELECT. Also, waiting is appended if the server process is presently waiting on a lock held by another session. At the time of taking this screenshot, they all are idle. But while your server on load, if you run the ps -aux | grep postgres command multiple times, you can see the activity keeps changing. The point we have to note here is that, is there any blocker process which is holding other process to complete the transaction. In that case, the holding process would have activity idle in transaction and the process which on-hold would have activity SELECT waiting. In more complicated cases it would be necessary to look into the pg_locks system view to determine who is blocking whom

In our case thing looked normal so we proceeded to the next step.

The Statistics Collector

The Statistics Collector is a subsystem that supports collection and reporting of information about server activity. It can count accesses to tables and indexes in both disk-block and individual-row terms. It also tracks the total number of rows in each table, and information about vacuum and analyzes actions for each table. It can also count calls to user-defined functions and the total time spent in each one. In the above screen shot, you can see the process with PID 24092 is the stats collector process.

The downside of collection of statistics is that it adds some overhead to query execution. But the system can be configured to collect or not collect information. Normally, these parameters are set in postgresql.conf so that they apply to all server processes and can be possible to turn them on or off in individual sessions using the SETcommand. More about statistics can be read here We did not do anything about stats collector.

We did not do anything with stats collector process.

Understanding Index Usage

Next, we went to check if indexing is proper. An index is a way to efficiently retrieve a relatively small number of rows from a large table. While accessing data from the cache is faster than disk, even data within memory can be slow if Postgres must parse through hundreds of thousands of rows to identify if they meet a certain condition.

To generate a list of tables in our database with the largest ones first and the percentage of time which they use an index, we can run:

SELECT relname, 
       100 * idx_scan / (seq_scan + idx_scan) percent_of_times_index_used, 
       n_live_tup rows_in_table 
FROM pg_stat_user_tables 
ORDER BY n_live_tup DESC;

Ideally, this value should be around 99% on the table over 10k rows.

As you can see, in our case it is 99% for most of the tables which have a higher number of rows. So it's fine, we proceeded with the next step.

Cache-hit ratio

Cache hit ratio tells us how often our data is serving from in-memory vs having to go to disk. Serving from memory vs going to disk will be orders of magnitude faster, thus the more we can keep in-memory the better. Watching our cache hit ratio and ensuring it is at 99% is a good metric for proper performance. If we find ourself with a ratio significantly lower than 99% then we likely want to consider increasing the cache available to our database.

To find the cache hit ratio, we run the following command:

SELECT 
  sum(heap_blks_read) as heap_read,
  sum(heap_blks_hit)  as heap_hit,
  sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio
FROM 
  pg_statio_user_tables;

Notice the cache hit ratio in our case was just 0.35%, which is bad. That is the first problem we encountered.

The solution

The reasons why the cache hit rate was low is that the database didn’t have enough space allocated in its internal cache buffer. This internal cache buffer is used for loading the table rows from disk to memory, and if there isn’t enough space allocated, then the database server will constantly have to fetch the data from disk. The cache buffer size is configured with the shared_buffer configuration. It tells the database how much of the machine’s memory it can allocate for storing data in memory. The default is incredibly low (128 MB)

Checking the current shared_buffer setting

To check if the current setting of shared_buffer, we can run the following command:

core=# SELECT name, setting, unit, min_val, max_val, context FROM pg_settings WHERE name = 'shared_buffers';

Here we are concerned with the setting column which is 16384. Also, note that the unit is 8kb which means, we need to convert the setting value. For example, 16384 * 8 = 131072 then we can convert this into MB or GB ie. 131072 / 1024 = 128MB which is the default value for shared_buffer.

Updating the shared_buffer

The first step is to figure out how much memory we can allocate and for that we need to find out how much is available in the system. To find available memory, we used free command.

As you can see we have enough space available, 13GB. The convention is to allocate 25% of RAM on a dedicated server, but in our case this was not a dedicated DB server so we allocated about 15% (this is a completely random number) of available memory which is equivalent to 2439777.6 ~= 2383 MB.

shared_buffer can be set in postgresql.conf file (which is the Postgres configuration file). We can get the location of this file by running the command SHOW config_file. Once we changed the setting, we had to restart the Postgres server and eventually the whole blockchain network to apply the change. To verify the change, we reran the SELECT query in the previous section.

We also re-checked the cache hit ratio using the command we used earlier and noticed the result, see the figure below. We got 99% cache hit ratio.

The effect

Once the hit ratio was improved, we did the performance test with 3 node network first using Jmeter and we got around 480TPS with just 3 nodes network with around 8k transactions and we did see improvement of about 30-40% in TPS.

Next, we tried with 20k transaction with all 6 nodes up and we got around 700TPS this time. See the figure below.

Conclusion

There is more room for optimisation on the database side as well as on the overall system and possibly this would be the first thing, an experience database administrator would do in the project. I wrote this blog for those who are not very experienced in database administration, like me, and to share some of my learning from the assignment. I hope it will help people out there who are getting into building performance and scalability of a system.

References