๐ฉ๐ผโ๐ซ Know your compound ๐จ๐ปโ๐ซ
Imagine the following table in Postgres:
(continued in comments)
# [ $davids.sh ] ยท message #198
๐ฉ๐ผโ๐ซ Know your compound ๐จ๐ปโ๐ซ
Imagine the following table in Postgres:
(continued in comments)
@ [ $davids.sh ] ยท # 764
CREATE TABLE some_table ( id UUID PRIMARY KEY, foreign_id UUID, created_at TIMESTAMP WITH TIMEZONE )
There is an index:
CREATE INDEX some_idx ON some_table (foreign_id, created_at);
Which query (and approximately by how much) will be faster:
SELECT foreign_id, MAX(created_at) FROM some_table WHERE created_at > now() - interval '1 minute' GROUP BY foreign_id;
OR
SELECT foreign_id, MAX(created_at) FROM some_table WHERE created_at > now() - interval '1 minute' AND foreign_id IN (SELECT id FROM foreign_table) GROUP BY foreign_id;
Vote ๐ for the first and ๐ for the second
And don't be afraid to be wrong, it's cool to see people trying
Answer in
3
2
1
You thought I'd give the answer? Hell no, I'll tell you how it works, and you should figure out the answer yourself
In short, compound indexes are stored by the database from left to right, meaning if you have an index (a, b, c) and you are searching by column (a) or (a, b), the index will work, but if you skip any of the columns, the index will not be used.
IMPORTANT: Each DB has its own way of handling compound indexes, so check your documentation.
Why you need to know this: because someone who doesn't know would go and create a new index, which would slow down all writes, while someone who knows would figure it out instantly and even speed up operations with this index (because now another query allows utilizing the cache of this index).
Bonus question: what type of index will be used in the nested SELECT if id is the primary key?
@ Ivan ITK ๐ซ ยท # 765
Here, using a covering index is correct because we don't need to join with the main table for the rows selected from the index, which would be an unnecessary operation.
@ Ivan ITK ๐ซ ยท # 766
And the correct answer is, an index is only needed here on created_at, and 1 query will filter normally.
This is why I love CockroachDB, that it can also hash such inefficient indexes as timestamp for efficient searching, because using an index on timestamp is generally a bad idea; it dilutes the index to almost a 1:1 ratio with the original data.
@ Ivan ITK ๐ซ ยท # 767
P.S. I double-checked, assuming there are no other columns in the table, meaning it completely matches the provided example, then there will be no difference in indexes between ("created_at") INCLUDE("foreign_id") and ("created_at", "foreign_id").
Although, there is a difference in the cost of reading data, which might play a role under other conditions, with Covering, less data is read.
@ [ $davids.sh ] ยท # 769
"there is an index" - meant that it is needed in other places, so if something needs to be changed, you can swap places in the compound so that old queries use the index and this new one without a hack.
@ Ivan ITK ๐ซ ยท # 770
Yes, you wrote correctly about the column order, but that's not the only nuance)
@ [ $davids.sh ] ยท # 771
About covering? What does it give in this situation?
@ Ivan ITK ๐ซ ยท # 772
With a standard index, after retrieving data from the index, we merge it with the original table for the selected rows. With a covering index, this is not done if all columns for selection are already in the index.
@ [ $davids.sh ] ยท # 773
Isn't that INCLUDES?
@ Ivan ITK ๐ซ ยท # 774
yes, that's it, I corrected it incorrectly in the syntax above with STORING) * corrected
@ [ $davids.sh ] ยท # 775
Listen, what is this beautiful visualization?
@ Ivan ITK ๐ซ ยท # 776
this is a built-in cockroach board for distsql (explain analyze(distsql)), here's a nice example
@ Ivan ITK ๐ซ ยท # 777
An indispensable tool for debugging distributed requests in production
@ [ $davids.sh ] ยท # 778
"time stamp split by hash" - can you tell me more?
I might be mistaken, but it seems like the timestamp would be stored as a tree where the root is the year, then a few nodes down is the month, then the week, then the hour, then the minute, then the second. In that case, yes, the results in the tree would be 1:1, but the tree itself would be much smaller.
So, if the nodes were Unix timestamps, I'd be shocked)
I don't understand how this can be optimized.
@ Ivan ITK ๐ซ ยท # 779
For pg, there's this https://explain.depesz.com and this https://tatiyants.com/pev/#/plans
@ Ivan ITK ๐ซ ยท # 780
No, the date value is stored in Unix epoch microseconds, and the default index, I'm not following pg closely enough to know what it is right now.
@ Ivan ITK ๐ซ ยท # 781
According to CockroachDB, here's the docs with a description, this applies to any monotonically increasing values to have a split point for distribution across nodes.
@ Ivan ITK ๐ซ ยท # 782
BRIN indexes are used there https://www.crunchydata.com/blog/postgresql-brin-indexes-big-data-performance-with-minimal-storage
@ Ivan ITK ๐ซ ยท # 783
https://www.postgresql.org/docs/current/indexes-types.html#INDEXES-TYPES-BRIN also has its storage scheme for monotonically increasing data.
@ [ $davids.sh ] ยท # 784
I've been ignoring it for a very long time) I'll read up on how it works, thank you!