9 Пост

8 Пост

12 Пост

2 Пост

5 Новости

Основная тема

Text Search

Подробнее

RUM in PostgreSQL is a specialized inverted index access method for search workloads. It is based on GIN code, but unlike plain GIN it stores additional information in the posting tree, especially data needed for ranking, phrase/proximity matching, and ordered retrieval. That is the core idea behind RUM.

What RUM is

A normal GIN index is already very good at answering questions like “which rows contain these lexemes?” for tsvector/tsquery full-text search. The weakness is that GIN does not keep enough positional and attached information inside the index to make ranking, phrase checks, and certain ordered searches as efficient as they could be. RUM addresses that by storing extra per-match information directly in the index structure.

So, conceptually:

GIN = fast matching
RUM = fast matching plus better support for ranking, phrase/proximity, and ordered result retrieval

Main benefits of RUM

According to the RUM documentation, compared to GIN, RUM has three headline advantages:

Faster ranking, because positional information is already in the index and the executor does not need an additional heap scan to retrieve lexeme positions.
Faster phrase search, for the same reason: phrase checks need positions.
Faster ordering by timestamp or other attached values, because extra information can be stored together with lexemes.

That is why RUM is mainly chosen for search systems, not for ordinary OLTP equality lookups.

What RUM is not

RUM is not a standard core access method in vanilla PostgreSQL installations. It is provided through the rum extension. In Postgres Pro documentation, CREATE INDEX lists RUM among available index methods in that distribution, and the dedicated rum module docs describe it as an extension module. Availability therefore depends on your server build and installed packages.

So this usually starts with:

CREATE EXTENSION rum;

If that fails, the extension is not installed on the server.

Internal idea

RUM is an inverted index, like GIN. In an inverted index, the system maps a token or key element to the rows that contain it. RUM extends this by keeping extra information with those postings. In full-text search, that extra information commonly includes positions of lexemes, and optionally attached values used for ordering. That design is what allows RUM to avoid some post-index heap work that GIN often still needs.

This is the key reason phrase search is faster: for phrase or proximity search, you do not only need to know that both words exist; you need to know where they occur relative to each other. RUM keeps the information needed for that much closer to the index scan itself.

Typical use cases

RUM is best when your workload looks like one of these:

full-text search with relevance ordering
phrase search
proximity search
“top N best matches” queries
search results ordered by a second attribute such as recency or timestamp
search-heavy applications where query quality matters more than insert speed

Examples:

article search engines
legal document search
banking/compliance search over narratives
knowledge bases
ticket/comment search
message archives with relevance + date ordering

Tradeoffs

RUM is not “GIN but always better.” The docs are clear that its strengths come with tradeoffs. Because it stores more information, RUM is generally heavier than GIN for maintenance. In practice that means slower index build and slower inserts/updates than plain GIN, while searches that need ranking/phrases/order can be significantly better.

So the usual tradeoff is:

Choose GIN when filtering speed is the main goal and writes matter a lot.
Choose RUM when search quality, phrase logic, ranking, or ordered retrieval matters more.

Operator classes

The RUM documentation has a dedicated operator-classes section, and in practice the most important ones are centered on full-text search, especially tsvector support. The most commonly discussed opclasses are:

rum_tsvector_ops
rum_tsvector_addon_ops

rum_tsvector_ops

This is the standard RUM operator class for full-text indexing of tsvector. It supports matching tsvector against tsquery, and it is the basic choice when you want faster phrase/ranking behavior than GIN can provide.

Example:

CREATE INDEX ix_docs_rum
ON docs
USING rum (fts rum_tsvector_ops);

rum_tsvector_addon_ops

This opclass is used when you want to attach an additional sortable value to the tsvector entries, such as a timestamp. That is what enables efficient “search and order by recency” behavior.

Pattern:

CREATE INDEX ix_docs_rum
ON docs
USING rum (fts rum_tsvector_addon_ops, created_at)
WITH (attach = 'created_at', to = 'fts');

This is one of the signature RUM features.

Query behavior

With RUM, full-text search still uses the normal PostgreSQL text-search model:

source text converted to tsvector
search expression built as tsquery
query uses @@ to match

Example:

SELECT id, title
FROM docs
WHERE fts @@ to_tsquery('english', 'postgresql & indexing');

Where RUM becomes especially valuable is when the query is not only asking for matches, but also for good order.

For example, the Postgres Pro explanation describes RUM as being able to return results in the needed order, similarly to how GiST can support nearest-neighbor retrieval.

A ranking-style query can look like:

SELECT id,
title,
fts <=> to_tsquery('english', 'postgresql & indexing') AS dist
FROM docs
WHERE fts @@ to_tsquery('english', 'postgresql & indexing')
ORDER BY fts <=> to_tsquery('english', 'postgresql & indexing')
LIMIT 20;

That <=> behavior is one of the practical reasons people use RUM.

Phrase and proximity search

Phrase search is where RUM is much more naturally suited than GIN, because phrase search depends on lexeme positions. The docs explicitly call out faster phrase search as a core benefit.

Examples:

SELECT id, body
FROM docs
WHERE fts @@ phraseto_tsquery('english', 'secured loan');

SELECT id, body
FROM docs
WHERE fts @@ to_tsquery('english', 'secured <-> loan');

These kinds of queries benefit from RUM because the index already carries the positional information needed to verify the phrase relationship efficiently.

Ordering by attached values

One of the strongest real-world RUM scenarios is this:

search documents by text
return the newest or otherwise best-ordered matching rows

RUM can attach an additional value, such as a timestamp, to the indexed lexeme information. This can make ordering by that attached value much faster than a standard GIN approach that finds matches and then performs more heap work to sort.

That makes RUM attractive for systems like:

news search
audit/event log search
customer communication search
issue trackers where latest relevant result matters

Multicolumn and advanced behavior

RUM supports more advanced indexing patterns, including multicolumn cases. Recent Postgres Pro release notes mention fixes related to scanning multi-column RUM indexes and order_by_attach, which confirms these capabilities are actively used and maintained.

That same release family also added low-level inspection functions for RUM pages, showing the access method continues to evolve in current Postgres Pro releases.

Файлы не прикреплены

Вернуться на предыдущую страницу