The same is true to a lesser extent in MySQL / MariaDB. It does better since it doesn’t do oldest-to-newest tuple chains, but it’s still adding non-trivial work to the DB, much of which is effectively wasted if you don’t care about the visibility of the deleted (or soon-to-be deleted) tuples to other transactions.
I sincerely hope that Planetscale’s efforts succeed long-term to shift devs’ understanding and acceptance of RDBMS operations. Their blog posts and docs are generally quite good. IME, devs (and even ops-ish teams) simply do not care about all of this, and will create elaborate bespoke tooling to run DELETEs in bulk, because they either don’t understand the capabilities of the database, or don’t want to deal with the [minor] increased complexity that a partitioned schema brings, and will happily pay the extra cost / latency for deletions.
Surprisingly to remove small numbers of rows in multiple tables (e.g. cleanup between automated tests), DELETE is often faster than TRUNCATE! It's counterintuitive but just measure it for yourself and see. Note you can DELETE from multiple tables in one statement using CTEs, and that way you don't need to think about foreign key dependency order.
Only by a weird definition of "scalable". The first sentence says:
> Counterintuitively, large DELETEs add work to the database.
There is nothing counterintuitive about this. It takes just as much work to delete a row as it takes to insert a row. Why wouldn't it? Obviously you have to do almost all the same operations: write a log, write the deletion, update indices, replicate it, etc.
And yes, it's a well-known trick for all major relational databases (not just Postgres) that if you want to delete 90% of rows from a large a table, it's much faster to just copy the rows you want to keep to a new table, run DROP TABLE on the old table, and rename the new table to the old table. Since DROP TABLE is ~instantaneous, mainly involving table-level metadata.
DELETE scales just fine, in the sense that if you are constantly inserting and deleting individual rows, DELETE scales the same as INSERT.
Basic database functionality is designed around the assumption of lots of small transactions. Whenever you have to do something involving millions of rows at once, you generally need to investigate solutions that work well in "bulk". E.g. loading rows directly from a file rather than with SQL, adding indices only after the data has been loaded rather than before, disabling foreign key checks on large operations (if you know by design that the keys are valid)... and yes, taking advantage of DROP TABLE instead of DELETE. This doesn't mean small transactions aren't scalable, it just means bulk operations are qualitatively different and benefit from their own solutions. And DELETE is no different from INSERT in this regard.
> It takes just as much work to delete a row as it takes to insert a row. Why wouldn't it? Obviously you have to do almost all the same operations: write a log, write the deletion, update indices, replicate it, etc.
It takes far more work to delete/update than insert. My recent example is updating ~2TB of text data was about 40x slower than inserting 12TB (was trying to correct some large text truncation that occurred during migration into PG, ended up being faster to redo).
> It takes far more work to delete/update than insert.
Updating rows of text data is going to be more work, because variable-length text can't be updated in-place. So in terms of allocating space, it's more like a delete plus an insert. That's not surprising.
I'm not aware of instances where a delete is "far more work" than an equivalent insert though. That's not the general case, and I'm having a hard time thinking of any situations where that would be true.
> And yes, it's a well-known trick for all major relational databases (not just Postgres) that if you want to delete 90% of rows from a large a table, it's much faster to just copy the rows you want to keep to a new table, run DROP TABLE on the old table, and rename the new table to the old table.
Dumb question but why does the optimizer not just do that in secret then? Seems like something that should be detectable with some heuristics.
I assume partly because that would be extremely surprising behavior, and depending on the RDBMS and version, could introduce unexpected stalls. For example, MySQL < 8.0.23 scans the entire buffer pool to clear pages that were dropped, which can take a long time on large instances. There is / was a similar issue with its adaptive hash index, which AFAIK wasn’t ever fixed, though AHI’s default being shifted to OFF in 8.4 is a workaround, in a very hacky way.
Maintaining the expected observable behaviours would get complicated if queries (especially other updates) against the same table are happening concurrently.
Partially true but too much of a blanket statement and clickbaity.
DELETE with well-tuned autovacuum works pretty well. Have seen it work at TBs scale with no hicuups. If DELETEs are large, we used to recommend customers to follow that with a manual VACUUM for table to reclaim space right away for future rows.
DROP TABLE can be risky, it requires an ACCESS EXCLUSIVE LOCK and if its waiting, it blocks all other statements following it, because of how lock queues work in Postgres. And you cannot keep doing high concurrent DROP TABLEs to run your large scale CRUD app.
This generalizes to most (all?) databases. Selective deletion is largely an unsolved problem at scale in databases to the extent it doesn't release the deleted resources. Under the hood databases try to turn this into selective resource truncation, which scales much better, but in most cases that is not possible without careful design of your data model.
Similarly, you often have to remind devs that in many databases an UPDATE is just an INSERT + DELETE, with all of the scaling issues implied.
CRUD apps don't usually delete in bulk. It's also hard to structure partitions in a way that doesn't wipe out months of important business data -- this is why teams often ETL their DB into Snowflake/ClickHouse and only then drop partitions. That makes it hard for the app to use that data again.
The better approach is either to change your storage engine (e.g. OrioleDB is working on adding the undo log to Pg), or to shard which distributes the vacuum load across multiple servers.
They should be performing bulk deletions, due to GDPR: “Data must be stored for the shortest time possible.” Unless you have some kind of rolling cron checking every few minutes (and even then, depending on your scale, that may well be considered bulk), that generally resolves to something like daily or weekly deletions.
Years ago work was bit by the analogous thing in MySQL. Like it usually does, it took a chain of events:
- We wrote a cronjob to periodically DELETE for a retention policy on a table we'd just created. Most senior person on the team reviewed it, looked fine.
- Unusually for us, we prioritize QA'ing a different feature for release, delaying the release of this cronjob and a bunch of other code.
- During that delay, the new table accumulated many times more rows to be deleted than we'd expected during review.
- Release happens. All looks well since the initial delete wasn't a migration and cronjob hasn't run yet; engineer doing the release signs off.
- Cronjob runs, deleting hundreds of millions of rows quickly.
- Next day, replica lag's high and MySQL's transaction history is very high. MySQL keeps transaction history around until purge threads have visited all the affected pages on disk.
- The bad cluster conditions last for days and lead to other problems.
This omits detail and the 'noise' of everything else we were watching. But it gets the idea across as far as how the code and MySQL behaved.
Like most exciting events, it led to multiple changes to avoid a repeat. For retention policies, our new approach was one at the end of PlanetScale's post, to partition and drop old partitions. Transitioning to this from a huge unpartitioned table can be fun!
If a table is append-only and already huge, with lots of rows already past the retention threshold, you might only copy the rows to be kept to the new partitioned table: copy what you can, lock tables, do a last catch-up copy and swap tables. (Roughly the blog's 'performant one-off delete'.)
If the table's merely kind of big, gh-ost or such could allow you to ALTER without causing lag, locking, etc.
At a scale below that, you could run a slow incremental 'nibble' delete while watching server stats, and a step below that, plain ALTERs or DELETEs are fine.
Running partitioned has fun bits. Like: in MySQL, the partition key has to be part of any unique index. You have to to remember that if you're using INSERT..ON DUPLICATE KEY UPDATE and relying on dupe key errors to ttrigger the update. Things stay interesting!
I hear Vitess shops like PlanetScale usually don't run multi-terabyte myqsld instances in the first place: even when physical nodes are big, they run many smaller mysqlds on them. That wouldn't eliminate all this--I think big deletes would still scale worse than copy-swap-drop--but it does seem real handy for taming issues that tend to worsen with mysqld size, like replication lag. All to say, little bit jelly of their setup over there!
been exploring clickhouse and while it is definitely not a general purpose DB, for time-series shaped data that can survive some insert latency, the automatic partition-based TTL is very nice and, at least so far, requires zero attention to maintain
which I guess is solved by `pg_partman` at the bottom of the post
The same is true to a lesser extent in MySQL / MariaDB. It does better since it doesn’t do oldest-to-newest tuple chains, but it’s still adding non-trivial work to the DB, much of which is effectively wasted if you don’t care about the visibility of the deleted (or soon-to-be deleted) tuples to other transactions.
I sincerely hope that Planetscale’s efforts succeed long-term to shift devs’ understanding and acceptance of RDBMS operations. Their blog posts and docs are generally quite good. IME, devs (and even ops-ish teams) simply do not care about all of this, and will create elaborate bespoke tooling to run DELETEs in bulk, because they either don’t understand the capabilities of the database, or don’t want to deal with the [minor] increased complexity that a partitioned schema brings, and will happily pay the extra cost / latency for deletions.
Surprisingly to remove small numbers of rows in multiple tables (e.g. cleanup between automated tests), DELETE is often faster than TRUNCATE! It's counterintuitive but just measure it for yourself and see. Note you can DELETE from multiple tables in one statement using CTEs, and that way you don't need to think about foreign key dependency order.
Only by a weird definition of "scalable". The first sentence says:
> Counterintuitively, large DELETEs add work to the database.
There is nothing counterintuitive about this. It takes just as much work to delete a row as it takes to insert a row. Why wouldn't it? Obviously you have to do almost all the same operations: write a log, write the deletion, update indices, replicate it, etc.
And yes, it's a well-known trick for all major relational databases (not just Postgres) that if you want to delete 90% of rows from a large a table, it's much faster to just copy the rows you want to keep to a new table, run DROP TABLE on the old table, and rename the new table to the old table. Since DROP TABLE is ~instantaneous, mainly involving table-level metadata.
DELETE scales just fine, in the sense that if you are constantly inserting and deleting individual rows, DELETE scales the same as INSERT.
Basic database functionality is designed around the assumption of lots of small transactions. Whenever you have to do something involving millions of rows at once, you generally need to investigate solutions that work well in "bulk". E.g. loading rows directly from a file rather than with SQL, adding indices only after the data has been loaded rather than before, disabling foreign key checks on large operations (if you know by design that the keys are valid)... and yes, taking advantage of DROP TABLE instead of DELETE. This doesn't mean small transactions aren't scalable, it just means bulk operations are qualitatively different and benefit from their own solutions. And DELETE is no different from INSERT in this regard.
how does that solution work if the table that is dropped has foreign key constraints?
> It takes just as much work to delete a row as it takes to insert a row. Why wouldn't it? Obviously you have to do almost all the same operations: write a log, write the deletion, update indices, replicate it, etc.
It takes far more work to delete/update than insert. My recent example is updating ~2TB of text data was about 40x slower than inserting 12TB (was trying to correct some large text truncation that occurred during migration into PG, ended up being faster to redo).
> It takes far more work to delete/update than insert.
Updating rows of text data is going to be more work, because variable-length text can't be updated in-place. So in terms of allocating space, it's more like a delete plus an insert. That's not surprising.
I'm not aware of instances where a delete is "far more work" than an equivalent insert though. That's not the general case, and I'm having a hard time thinking of any situations where that would be true.
> And yes, it's a well-known trick for all major relational databases (not just Postgres) that if you want to delete 90% of rows from a large a table, it's much faster to just copy the rows you want to keep to a new table, run DROP TABLE on the old table, and rename the new table to the old table.
Dumb question but why does the optimizer not just do that in secret then? Seems like something that should be detectable with some heuristics.
I assume partly because that would be extremely surprising behavior, and depending on the RDBMS and version, could introduce unexpected stalls. For example, MySQL < 8.0.23 scans the entire buffer pool to clear pages that were dropped, which can take a long time on large instances. There is / was a similar issue with its adaptive hash index, which AFAIK wasn’t ever fixed, though AHI’s default being shifted to OFF in 8.4 is a workaround, in a very hacky way.
Maintaining the expected observable behaviours would get complicated if queries (especially other updates) against the same table are happening concurrently.
It drops dependents.
Partially true but too much of a blanket statement and clickbaity.
DELETE with well-tuned autovacuum works pretty well. Have seen it work at TBs scale with no hicuups. If DELETEs are large, we used to recommend customers to follow that with a manual VACUUM for table to reclaim space right away for future rows.
DROP TABLE can be risky, it requires an ACCESS EXCLUSIVE LOCK and if its waiting, it blocks all other statements following it, because of how lock queues work in Postgres. And you cannot keep doing high concurrent DROP TABLEs to run your large scale CRUD app.
This generalizes to most (all?) databases. Selective deletion is largely an unsolved problem at scale in databases to the extent it doesn't release the deleted resources. Under the hood databases try to turn this into selective resource truncation, which scales much better, but in most cases that is not possible without careful design of your data model.
Similarly, you often have to remind devs that in many databases an UPDATE is just an INSERT + DELETE, with all of the scaling issues implied.
CRUD apps don't usually delete in bulk. It's also hard to structure partitions in a way that doesn't wipe out months of important business data -- this is why teams often ETL their DB into Snowflake/ClickHouse and only then drop partitions. That makes it hard for the app to use that data again.
The better approach is either to change your storage engine (e.g. OrioleDB is working on adding the undo log to Pg), or to shard which distributes the vacuum load across multiple servers.
They should be performing bulk deletions, due to GDPR: “Data must be stored for the shortest time possible.” Unless you have some kind of rolling cron checking every few minutes (and even then, depending on your scale, that may well be considered bulk), that generally resolves to something like daily or weekly deletions.
Deleting whole file is faster then deleting rows in file.
TIL about pg_partman
https://github.com/pgpartman/pg_partman/blob/development/doc...
Years ago work was bit by the analogous thing in MySQL. Like it usually does, it took a chain of events:
- We wrote a cronjob to periodically DELETE for a retention policy on a table we'd just created. Most senior person on the team reviewed it, looked fine.
- Unusually for us, we prioritize QA'ing a different feature for release, delaying the release of this cronjob and a bunch of other code.
- During that delay, the new table accumulated many times more rows to be deleted than we'd expected during review.
- Release happens. All looks well since the initial delete wasn't a migration and cronjob hasn't run yet; engineer doing the release signs off.
- Cronjob runs, deleting hundreds of millions of rows quickly.
- Next day, replica lag's high and MySQL's transaction history is very high. MySQL keeps transaction history around until purge threads have visited all the affected pages on disk.
- The bad cluster conditions last for days and lead to other problems.
This omits detail and the 'noise' of everything else we were watching. But it gets the idea across as far as how the code and MySQL behaved.
Like most exciting events, it led to multiple changes to avoid a repeat. For retention policies, our new approach was one at the end of PlanetScale's post, to partition and drop old partitions. Transitioning to this from a huge unpartitioned table can be fun!
If a table is append-only and already huge, with lots of rows already past the retention threshold, you might only copy the rows to be kept to the new partitioned table: copy what you can, lock tables, do a last catch-up copy and swap tables. (Roughly the blog's 'performant one-off delete'.)
If the table's merely kind of big, gh-ost or such could allow you to ALTER without causing lag, locking, etc.
At a scale below that, you could run a slow incremental 'nibble' delete while watching server stats, and a step below that, plain ALTERs or DELETEs are fine.
Running partitioned has fun bits. Like: in MySQL, the partition key has to be part of any unique index. You have to to remember that if you're using INSERT..ON DUPLICATE KEY UPDATE and relying on dupe key errors to ttrigger the update. Things stay interesting!
I hear Vitess shops like PlanetScale usually don't run multi-terabyte myqsld instances in the first place: even when physical nodes are big, they run many smaller mysqlds on them. That wouldn't eliminate all this--I think big deletes would still scale worse than copy-swap-drop--but it does seem real handy for taming issues that tend to worsen with mysqld size, like replication lag. All to say, little bit jelly of their setup over there!
I have been using TRUNCATE allm the way, and is fast as very good.
Yep, partitions are the way to go there.
^ this
been exploring clickhouse and while it is definitely not a general purpose DB, for time-series shaped data that can survive some insert latency, the automatic partition-based TTL is very nice and, at least so far, requires zero attention to maintain
which I guess is solved by `pg_partman` at the bottom of the post