Rebuilding Tables online using pg_repack in PostgreSQL

April 15, 2024

/

Oracle to PostgreSQL, PostgreSQL, PostgreSQL Extensions, SQL Server to PostgreSQL

Any database engine, including PostgreSQL, requires some routine maintenance to ensure optimal performance. Especially when there are massive delete or purge operations to delete rows from a Table, the fragmentation builds up within a table. Reclaim this fragmented space from disk requires a table to be rebuilt. At HexaCluster, we had many Customers asking us if there are similar capabilities in PostgreSQL to rebuild tables online while migrating from Oracle to PostgreSQL. In this article, we shall discuss about the extension called : pg_repack to rebuild tables online in PostgreSQL and see how it can be compared with DBMS_REDEFINITION in Oracle, for rebuilding tables online. Oracle Database 12c Release 2 offers much more efficient options for rebuilding tables and partitions online using its Online Table Move feature.

When and Why should we rebuild tables in PostgreSQL ?

You might have come across the term called VACUUM if you have deployed PostgreSQL as the database engine for your applications. VACUUM is one of the key maintenance operations in PostgreSQL, responsible for cleaning up dead tuples from the table pages, thereby freeing up space. This space can either be re-used for future insertions and updates or remain unused and lead to fragmentation. With massive data purge operations involving deletion of a significant percentage of data from a table, the amount of fragmentation may be significantly larger. Eventually, these fragmented pages increase the size of the table and utilize more disk space than required. Such fragmentation might distribute tuples across an unnecessarily large number of pages, creating performance bottlenecks. To mitigate these issues, table rebuilds are sometimes necessary. Contact HexaCluster if you would like to get expert assistance here.

Why not use VACUUM FULL to rebuild tables in PostgreSQL ?

VACUUM FULL is always the safest and the standard method for rebuilding tables in PostgreSQL. As an example, to rebuild a table called : migrate_from_oracle_to_postgres using vacuum full, we can use the following command.

psql -U postgres -d hexacluster -p 5432 -c "VACUUM FULL migrate_from_oracle_to_postgres"

Sometimes, you may notice that it is faster to rebuild a table and its indexes in Postgres using "VACUUM FULL" compared to other extensions. However, this command exclusively locks a table from READ and WRITE operations during the rebuild and thus not always considered suitable by database engineers in production environments.

Before discussing about an interesting extension in PostgreSQL to rebuild tables online, let us see how the same can be performed in Oracle.

Does Oracle have a Package to rebuild tables online ?

Yes, Enterprise editions (versions) of Oracle supports a package called : DBMS_REDEFINITION that can help rebuilding tables online. Through this package, Oracle also supports adding or dropping columns online and changes to the partition strategies of a table.

Please Note: Oracle Database 12c Release 2 offers more simpler and efficient options to move tables and their partitions online.

However, there are a variety of manual steps an Oracle database engineer has to monitor and execute sequentially, to achieve the supported features. As an example, to rebuild a table called : MIGRATE_TO_POSTGRES in Schema : HEXACLUSTER online using DBMS_REDEFINITION in Oracle, following steps are involved.

Please Note : There are 6 manual steps involved in successfully rebuilding a Table online in Oracle.

Verify if the table can be rebuilt online

BEGIN
DBMS_REDEFINITION.CAN_REDEF_TABLE('HEXACLUSTER', 'MIGRATE_TO_POSTGRESQL');
END;

Create an Interim Table with the same structure as the existing Oracle table.

CREATE TABLE HEXACLUSTER.NEW_MIGRATE_TO_POSTGRESQL (
-- Define columns and data types as needed
);

Start the Redefinition process.

BEGIN
DBMS_REDEFINITION.START_REDEF_TABLE(
  uname             => 'HEXACLUSTER',
  orig_table        => 'MIGRATE_TO_POSTGRESQL',
  int_table         => 'NEW_MIGRATE_TO_POSTGRESQL'
);
END;

Copy the dependents such as Triggers, Constraints, Indexes and Grants from the original tables to the interim table.

BEGIN
DBMS_REDEFINITION.COPY_TABLE_DEPENDENTS(
  uname             => 'HEXACLUSTER',
  orig_table        => 'MIGRATE_TO_POSTGRESQL',
  int_table         => 'NEW_MIGRATE_TO_POSTGRESQL',
  copy_indexes      => 1,
  copy_triggers     => TRUE,
  copy_constraints  => TRUE,
  copy_privileges   => TRUE
);
END;

Synchronize the delta changes from the original table to the interim table. As the original table may be receiving changes due to live traffic, it is important to apply the changes to the interim table and keep it consistent.
```
BEGIN
DBMS_REDEFINITION.SYNC_INTERIM_TABLE(
  uname        => 'HEXACLUSTER',
  orig_table   => 'MIGRATE_TO_POSTGRESQL',
  int_table    => 'NEW_MIGRATE_TO_POSTGRESQL'
);
END;
```
Once the above step is completed, we can complete the redefinition process by executing the following final step that acquires a lock, synchronizes the changes from original table to interim table and then swaps the tables.
```
BEGIN
DBMS_REDEFINITION.FINISH_REDEF_TABLE(
  uname         => 'HEXACLUSTER',
  orig_table    => 'MIGRATE_TO_POSTGRESQL',
  int_table     => 'NEW_MIGRATE_TO_POSTGRESQL'
);
END;
```
Drop the interim table which is indeed the fragmented table, renamed due to swap performed in the previous step.

What about PostgreSQL ? Is there any Postgres extension to rebuild tables online ?

Yes, there exists an extension in PostgreSQL called : pg_repack, that acts like a smarter VACUUM FULL. It can be used to rebuild a table online and efficiently remove bloat and fragmentation while allowing concurrent read/write operations on that table.

Is rebuilding a table simple in PostgreSQL compared to Oracle ?

Definitely yes. It is simple to rebuild a table online in PostgreSQL compared to Oracle. This is because, all the 6 manual steps involved in DBMS_REDEFINITION for Oracle can be replaced with a single step in PostgreSQL using pg_repack.

As an example, the following steps can be simply used to rebuidl a table online in PostgreSQL using pg_repack.

-- To perform a dry-run and see whether we can rebuild table online
postgres@hexacluster:~$ pg_repack --table users -d hexacluster --dry-run

-- To execute and rebuild table online
postgres@hexacluster:~$ pg_repack --table users -d hexacluster
INFO: repacking table "public.users"

Steps involved in setting up pg_repack in PostgreSQL

In the following sections, we shall see the steps involved in setting up this extension and using it for rebuilding tables online.

Pre-requisites

To ensure efficient performance, the target table should possess either a primary key or at least a unique index covering a non-null column.
The pg_repack binary and the extension version installed in the database should match.
It only supports PostgreSQL 9.5 and above.
Executing a complete table repack necessitates available disk space roughly twice the size of the target table(s) and their associated indexes.
pg_repack will not work on postgresql catalog tables.
Only table owners and superusers can use pg_repack to rebuild the table.

Installing pg_repack extension

Installing pg_repack can be either done by using the package manager and PostgreSQL repository or we can do a source code installation.

Installing on RedHat/Rocky/AlmaLinux/CentOS/Fedora/Oracle Linux

Once you set the PGDG repository for Centos/Redhat you can use the below command.

yum install pg_repack_<PG_MAJOR_VERSION>.x86_64
Eg: yum install pg_repack_16.x86_64

Installing on Debian/Ubuntu systems

Setup the latest PGDG repository in the ubuntu machine where you need to install pg_repack

# Create the file repository configuration: 
sudo sh -c 'echo "deb https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list' 

# Import the repository signing key: 
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add - 

# Update the package lists: 
sudo apt-get update

#install the software. Considering the target database as PostgreSQL 16, the command appears as following.
sudo apt-get install postgresql-16-repack

Creating the Extension in the database

Once the software is installed in the database server, we need to create the extension in the database in which we are going to rebuild the tables.

psql -U postgres -d hexacluster -p 5432 -c "CREATE EXTENSION pg_repack"

Rebuilding tables online with pg_repack

To rebuild a table with pg_repack we use the below command. The pg_repack utility includes a feature that allows it to assess whether a table is suitable for repacking through a dry run option.

Dryrun

postgres@hexacluster:~$ pg_repack --table users -d hexacluster --dry-run
INFO: Dry run enabled, not executing repack
INFO: repacking table "public.users"

Actual Execution

postgres@hexacluster:~$ pg_repack --table users -d hexacluster
INFO: repacking table "public.users"
postgres@hexacluster:~$

Actual Execution with Detailed logging:

Executing pg_repack with detailed logging can be done by passing –echo or -e to the command; this will show us the detailed steps that pg_repack does when triggered on a table.

postgres@hexacluster:~$ pg_repack -e --table users -d hexacluster

LOG: (query) SET search_path TO pg_catalog, pg_temp, public
LOG: (query) SET search_path TO pg_catalog, pg_temp, public
LOG: (query) select repack.version(), repack.version_sql()
LOG: (query) SET statement_timeout = 0
LOG: (query) SET search_path = pg_catalog, pg_temp, public
LOG: (query) SET client_min_messages = warning
LOG: (query) SELECT r FROM (VALUES ($1, 'r')) AS given_t(r,kind) WHERE NOT EXISTS(  SELECT FROM repack.tables WHERE relid=to_regclass(given_t.r)) AND NOT EXISTS(  SELECT FROM pg_catalog.pg_class c WHERE c.oid=to_regclass(given_t.r) AND c.relkind = given_t.kind AND given_t.kind = 'p')
LOG:    (param:0) = users
LOG: (query) SELECT t.*, coalesce(v.tablespace, t.tablespace_orig) as tablespace_dest FROM repack.tables t,  (VALUES (quote_ident($1::text))) as v (tablespace) WHERE (relid = $2::regclass) ORDER BY t.relname, t.schemaname
LOG:    (param:0) = (null)
LOG:    (param:1) = users
INFO: repacking table "public.users"
LOG: (query) SELECT pg_try_advisory_lock($1, CAST(-2147483648 + $2::bigint AS integer))
LOG:    (param:0) = 16185446
LOG:    (param:1) = 16452
LOG: (query) BEGIN ISOLATION LEVEL READ COMMITTED
LOG: (query) SET LOCAL lock_timeout = 100
LOG: (query) LOCK TABLE public.users IN ACCESS EXCLUSIVE MODE
LOG: (query) RESET lock_timeout
LOG: (query) SELECT pg_get_indexdef(indexrelid) FROM pg_index WHERE indrelid = $1 AND NOT indisvalid
LOG:    (param:0) = 16452
LOG: (query) SELECT indexrelid, repack.repack_indexdef(indexrelid, indrelid, $2, FALSE)  FROM pg_index WHERE indrelid = $1 AND indisvalid
LOG:    (param:0) = 16452
LOG:    (param:1) = (null)
LOG: (query) SELECT repack.conflicted_triggers($1)
LOG:    (param:0) = 16452
LOG: (query) SELECT repack.create_index_type(16458,16452)
LOG: (query) SELECT repack.create_log_table(16452)
LOG: (query) CREATE TRIGGER repack_trigger AFTER INSERT OR DELETE OR UPDATE ON public.users FOR EACH ROW EXECUTE PROCEDURE repack.repack_trigger('user_id')
LOG: (query) ALTER TABLE public.users ENABLE ALWAYS TRIGGER repack_trigger
LOG: (query) SELECT repack.disable_autovacuum('repack.log_16452')
LOG: (query) BEGIN ISOLATION LEVEL READ COMMITTED
LOG: (query) SELECT pg_backend_pid()
LOG: (query) SELECT pid FROM pg_locks WHERE locktype = 'relation' AND granted = false AND relation = 16452 AND mode = 'AccessExclusiveLock' AND pid <> pg_backend_pid()
LOG: (query) COMMIT
LOG: (query) BEGIN ISOLATION LEVEL SERIALIZABLE
LOG: (query) SELECT set_config('work_mem', current_setting('maintenance_work_mem'), true)
LOG: (query) SELECT coalesce(array_agg(l.virtualtransaction), '{}')   FROM pg_locks AS l   LEFT JOIN pg_stat_activity AS a     ON l.pid = a.pid   LEFT JOIN pg_database AS d     ON a.datid = d.oid   WHERE l.locktype = 'virtualxid'   AND l.pid NOT IN (pg_backend_pid(), $1)   AND (l.virtualxid, l.virtualtransaction) <> ('1/1', '-1/0')   AND (a.application_name IS NULL OR a.application_name <> $2)  AND a.query !~* E'^\\s*vacuum\\s+'   AND a.query !~ E'^autovacuum: '   AND ((d.datname IS NULL OR d.datname = current_database()) OR l.database = 0)
LOG:    (param:0) = 112350
LOG:    (param:1) = pg_repack
LOG: (query) DELETE FROM repack.log_16452
LOG: (query) SELECT pid FROM pg_locks WHERE locktype = 'relation' AND granted = false AND relation = 16452 AND mode = 'AccessExclusiveLock' AND pid <> pg_backend_pid()
LOG: (query) SET LOCAL lock_timeout = 100
LOG: (query) LOCK TABLE public.users IN ACCESS SHARE MODE
LOG: (query) RESET lock_timeout
LOG: (query) SELECT repack.create_table($1, $2)
LOG:    (param:0) = 16452
LOG:    (param:1) = pg_default
LOG: (query) INSERT INTO repack.table_16452 SELECT user_id,firstname,lastname,address,pincode,state,city,email,joining_date FROM ONLY public.users
LOG: (query) SELECT repack.disable_autovacuum('repack.table_16452')
LOG: (query) COMMIT
LOG: (query) CREATE UNIQUE INDEX index_16458 ON repack.table_16452 USING btree (user_id)
LOG: (query) SELECT repack.repack_apply($1, $2, $3, $4, $5, $6)
LOG:    (param:0) = SELECT * FROM repack.log_16452 ORDER BY id LIMIT $1
LOG:    (param:1) = INSERT INTO repack.table_16452 VALUES ($1.*)
LOG:    (param:2) = DELETE FROM repack.table_16452 WHERE (user_id) = ($1.user_id)
LOG:    (param:3) = UPDATE repack.table_16452 SET (user_id, firstname, lastname, address, pincode, state, city, email, joining_date) = ($2.user_id, $2.firstname, $2.lastname, $2.address, $2.pincode, $2.state, $2.city, $2.email, $2.joining_date) WHERE (user_id) = ($1.user_id)
LOG:    (param:4) = DELETE FROM repack.log_16452 WHERE id IN (
LOG:    (param:5) = 1000
LOG: (query) SELECT pid FROM pg_locks WHERE locktype = 'virtualxid' AND pid <> pg_backend_pid() AND virtualtransaction = ANY($1)
LOG:    (param:0) = {}
LOG: (query) SAVEPOINT repack_sp1
LOG: (query) SET LOCAL lock_timeout = 100
LOG: (query) LOCK TABLE public.users IN ACCESS EXCLUSIVE MODE
LOG: (query) RESET lock_timeout
LOG: (query) SELECT repack.repack_apply($1, $2, $3, $4, $5, $6)
LOG:    (param:0) = SELECT * FROM repack.log_16452 ORDER BY id LIMIT $1
LOG:    (param:1) = INSERT INTO repack.table_16452 VALUES ($1.*)
LOG:    (param:2) = DELETE FROM repack.table_16452 WHERE (user_id) = ($1.user_id)
LOG:    (param:3) = UPDATE repack.table_16452 SET (user_id, firstname, lastname, address, pincode, state, city, email, joining_date) = ($2.user_id, $2.firstname, $2.lastname, $2.address, $2.pincode, $2.state, $2.city, $2.email, $2.joining_date) WHERE (user_id) = ($1.user_id)
LOG:    (param:4) = DELETE FROM repack.log_16452 WHERE id IN (
LOG:    (param:5) = 0
LOG: (query) SELECT repack.repack_swap($1)
LOG:    (param:0) = 16452
LOG: (query) COMMIT
LOG: (query) BEGIN ISOLATION LEVEL READ COMMITTED
LOG: (query) SAVEPOINT repack_sp1
LOG: (query) SET LOCAL lock_timeout = 100
LOG: (query) LOCK TABLE public.users IN ACCESS EXCLUSIVE MODE
LOG: (query) RESET lock_timeout
LOG: (query) SELECT repack.repack_drop($1, $2)
LOG:    (param:0) = 16452
LOG:    (param:1) = 4
LOG: (query) COMMIT
LOG: (query) BEGIN ISOLATION LEVEL READ COMMITTED
LOG: (query) ANALYZE public.users
LOG: (query) COMMIT
LOG: (query) SELECT pg_advisory_unlock($1, CAST(-2147483648 + $2::bigint AS integer))
LOG:    (param:0) = 16185446
LOG:    (param:1) = 16452
postgres@hexacluster:~$

When it is run on a table which does not contain either a primary key or a not null unique key it throws an warning message like below and exits.

postgres@hexacluster:~$ pg_repack --table books -d hexacluster
WARNING: relation "public.books" must have a primary key or not-null unique keys

Note: When pg_repack is triggered for the entire database it first prepares a list of tables that contains a PRIMARY KEY or a NOT NULL UNIQUE Constraint and works only on those tables.

Steps performed by pg_repack internally

Step 1: Log Table Creation:

The tool begins its process by creating a log table to track changes made to the original table.
This log table will record INSERTs, UPDATEs, and DELETEs performed on the original table.

Step 2: Trigger Addition:

Next, pg_repack adds a trigger onto the original table.
This trigger is responsible for capturing and logging any INSERTs, UPDATEs, and DELETEs made to the original table into the log table created in the previous step.

Step 3: New Table Creation:

The next step is to create a new table that contains all the rows from the old table.
This new table serves as a fresh container for the data from the original table.

Step 4: Index Building:

After creating the new table, pg_repack constructs indexes on this new table.
Indexes are crucial for optimizing query performance by speeding up data retrieval operations.

Step 5: Applying Changes from Log Table:

pg_repack then applies all the changes logged in the log table to the new table.
This ensures that the new table is synchronized with the latest data modifications from the original table.

Step 6: Table Swap:

Once the new table is up-to-date with the changes from the log table, pg_repack swaps the tables.
This swap involves replacing the old table with the new one, including all associated indexes and toast tables.
The system catalogs are updated to reflect this table swap.

Step 7: Original Table Drop:

Finally, pg_repack drops the original table.
The old table is no longer needed as its data has been successfully transferred to the new table.

Locking Mechanism in pg_repack

Throughout the process, pg_repack acquires and releases locks on the original table.
Initially, it holds an ACCESS EXCLUSIVE lock during the setup (steps 1 and 2) and the final swap-and-drop phase (steps 6 and 7).
However, for the majority of its operation, pg_repack only requires an ACCESS SHARE lock on the original table.
This allows concurrent INSERTs, UPDATEs, and DELETEs to proceed without significant interruptions.

Important Arguments that needs to be considered in a production environment

-T SECS, –wait-timeout=SECS:
- Controls the duration (in seconds) pg_repack will wait to acquire exclusive locks at the beginning and end of the repacking process.
- If the lock cannot be obtained within this timeframe and the –no-kill-backend option is not specified, pg_repack will forcibly cancel conflicting queries.
- Default value is 60 seconds.
-D, –no-kill-backend:
- Allows pg_repack to skip repacking the table if the lock cannot be obtained within the duration specified by –wait-timeout.
- By default, conflicting queries are canceled if the lock cannot be acquired.
- In a production system where the user queries should not get affected or terminated using this option is highly recommended along with the higher –wait-timeout seconds.
-k, –no-superuser-check:
- Skips the superuser checks in the client.
- Useful for running pg_repack on platforms that support non-superuser execution.
- This option is useful where users don’t have superuser privileges. Ideally used in DBAAS environments like RDS etc.
-j, –jobs:
- Specifies the number of extra connections to PostgreSQL to create.
- These extra connections are utilized to parallelize the rebuild of indexes on each table during the repacking process.
- Parallel index builds are supported only for full-table repacks, not with –index or –only-indexes options.
- If the PostgreSQL server has additional cores and available disk I/O, this option can significantly accelerate pg_repack’s performance.

Conclusion

In conclusion, pg_repack emerges as a valuable tool for PostgreSQL database administrators seeking to manage disk space efficiently while minimizing downtime and performance impacts. It provides a robust solution for PostgreSQL users to maintain database health by efficiently rebuilding tables online, addressing fragmentation, and minimizing table locking.

With its capabilities, users can optimize database performance and storage utilization, ensuring smooth operation even in high-traffic environments.

Seeking specialized PostgreSQL support or Database Migrations assistance ? Get experts advice for effective data management, query tuning, database migrations and database optimization. Click here for personalized expertise to elevate your database performance and meet your unique business needs.

In our next article we will cover about pg-osc a tool for making schema changes in PostgreSQL tables with minimal locks, thus helping achieve zero downtime schema changes against production workloads.

Subscribe to our Newsletters and Stay tuned for more interesting topics.

Contact Us Today!

Author

Manisankar K

Manisankar is the Director of Database Operations at HexaCluster. Before joining HexaCluster, he worked as a Senior Database Manager at MigOps and as a Database Consultant at HCL. As a PostgreSQL expert, Mani has supported numerous customers in deploying PostgreSQL and published a number of technical articles in his free time. In addition to PostgreSQL, he is also skilled in other database technologies such as MySQL, Oracle, Cassandra and more. Manisankar is also well-versed in development languages such as Python, Groovy, and Perl, empowering him to create solutions that automate complex operations for Customers.

Add A Comment Cancel reply

Our PostgreSQL Services

Contact Us

Subscribe to our Free Newsletters

Our Machine Learning Services