👩🏼‍🏫 Know your compound 👨🏻‍🏫

# [ $davids.sh ] · message #198

👩🏼‍🏫 Know your compound 👨🏻‍🏫

Imagine the following table in Postgres:

(continued in comments)

@ [ $davids.sh ] · # 764

CREATE TABLE some_table ( id UUID PRIMARY KEY, foreign_id UUID, created_at TIMESTAMP WITH TIMEZONE )

There is an index:

CREATE INDEX some_idx ON some_table (foreign_id, created_at);

Which query (and approximately by how much) will be faster:

SELECT foreign_id, MAX(created_at) FROM some_table WHERE created_at > now() - interval '1 minute' GROUP BY foreign_id;

OR

SELECT foreign_id, MAX(created_at) FROM some_table WHERE created_at > now() - interval '1 minute' AND foreign_id IN (SELECT id FROM foreign_table) GROUP BY foreign_id;

Vote 👍 for the first and 👎 for the second

And don't be afraid to be wrong, it's cool to see people trying

Answer in

3

2

1

You thought I'd give the answer? Hell no, I'll tell you how it works, and you should figure out the answer yourself

In short, compound indexes are stored by the database from left to right, meaning if you have an index (a, b, c) and you are searching by column (a) or (a, b), the index will work, but if you skip any of the columns, the index will not be used.

IMPORTANT: Each DB has its own way of handling compound indexes, so check your documentation.

Why you need to know this: because someone who doesn't know would go and create a new index, which would slow down all writes, while someone who knows would figure it out instantly and even speed up operations with this index (because now another query allows utilizing the cache of this index).

Bonus question: what type of index will be used in the nested SELECT if id is the primary key?
@ Ivan ITK 🚫 · # 765

Here, using a covering index is correct because we don't need to join with the main table for the rows selected from the index, which would be an unnecessary operation.
@ Ivan ITK 🚫 · # 766

And the correct answer is, an index is only needed here on created_at, and 1 query will filter normally.

This is why I love CockroachDB, that it can also hash such inefficient indexes as timestamp for efficient searching, because using an index on timestamp is generally a bad idea; it dilutes the index to almost a 1:1 ratio with the original data.
@ Ivan ITK 🚫 · # 767

P.S. I double-checked, assuming there are no other columns in the table, meaning it completely matches the provided example, then there will be no difference in indexes between ("created_at") INCLUDE("foreign_id") and ("created_at", "foreign_id").

Although, there is a difference in the cost of reading data, which might play a role under other conditions, with Covering, less data is read.
@ [ $davids.sh ] · # 769

"there is an index" - meant that it is needed in other places, so if something needs to be changed, you can swap places in the compound so that old queries use the index and this new one without a hack.
@ Ivan ITK 🚫 · # 770

Yes, you wrote correctly about the column order, but that's not the only nuance)
@ [ $davids.sh ] · # 771

About covering? What does it give in this situation?
@ Ivan ITK 🚫 · # 772

With a standard index, after retrieving data from the index, we merge it with the original table for the selected rows. With a covering index, this is not done if all columns for selection are already in the index.
@ [ $davids.sh ] · # 773

Isn't that INCLUDES?
@ Ivan ITK 🚫 · # 774

yes, that's it, I corrected it incorrectly in the syntax above with STORING) * corrected
@ [ $davids.sh ] · # 775

Listen, what is this beautiful visualization?
@ Ivan ITK 🚫 · # 776

this is a built-in cockroach board for distsql (explain analyze(distsql)), here's a nice example
@ Ivan ITK 🚫 · # 777

An indispensable tool for debugging distributed requests in production
@ [ $davids.sh ] · # 778

"time stamp split by hash" - can you tell me more?

I might be mistaken, but it seems like the timestamp would be stored as a tree where the root is the year, then a few nodes down is the month, then the week, then the hour, then the minute, then the second. In that case, yes, the results in the tree would be 1:1, but the tree itself would be much smaller.

So, if the nodes were Unix timestamps, I'd be shocked)

I don't understand how this can be optimized.
@ Ivan ITK 🚫 · # 779

For pg, there's this https://explain.depesz.com and this https://tatiyants.com/pev/#/plans
@ Ivan ITK 🚫 · # 780

No, the date value is stored in Unix epoch microseconds, and the default index, I'm not following pg closely enough to know what it is right now.
@ Ivan ITK 🚫 · # 781

According to CockroachDB, here's the docs with a description, this applies to any monotonically increasing values to have a split point for distribution across nodes.
@ Ivan ITK 🚫 · # 782

BRIN indexes are used there https://www.crunchydata.com/blog/postgresql-brin-indexes-big-data-performance-with-minimal-storage
@ Ivan ITK 🚫 · # 783

https://www.postgresql.org/docs/current/indexes-types.html#INDEXES-TYPES-BRIN also has its storage scheme for monotonically increasing data.
@ [ $davids.sh ] · # 784

I've been ignoring it for a very long time) I'll read up on how it works, thank you!