From my read of the article, this isn't a problem "in DNS." OP runs an uptime monitoring service that purports to check whether DNS can resolve your domain — but today OP learned that because he's hitting his ISP's recursive resolver, he doesn't notice downtime until the TTL of the previous response expires.
Solution (which everyone else does, and OP has now implemented): don't use your ISP's recursive resolver! Run your own instance of bind9 or whatever, with the cache disabled. Or it seems like `dig +trace` would probably work, too.
"Cached resources remain visible for their whole TTL, even if the original becomes unreachable or changes" seems like one of the very first principles someone should learn when deciding to go into business selling an uptime monitoring service.
It's not a "ghost domain," it's a Time-To-Live field!
I get where you're coming from, but I think the part you're missing is that caching is not optional (per the "with the cache disabled" comment). If you're running a service like theirs, you need to do caching to some degree.
My understanding of their implementation is that it does caching on a per-worker basis with lower TTLs, so it balances caching with accuracy, but doesn't "fix" visibility to the problem either. However, it narrows the window such that their method of monitoring will very likely expose a pulled domain sooner.
One approach to solving this for a very limited set of intervals is to actually block namespace that has been removed at the registry level. There is a paper on this from Raffaele Sommese:
Quad9 (9.9.9.9) consumes this feed from U Twente of "just deleted" names, as most of them are malicious, and blocking them even if they are NOT malicious causes zero harm. Currently, this is only names that are very short-lived, so may not catch the longer intervals where names are deleted and become ghosts.
Another model using something similar would be to specifically clear those "just-deleted" name cached entries out of the recursive resolver, but that is expensive. Also, with blocking instead of removal it is possible to get high-level metrics on how often those are being abused where NXDOMAIN tracking is not measured in the same dimensions.
From my read of the article, this isn't a problem "in DNS." OP runs an uptime monitoring service that purports to check whether DNS can resolve your domain — but today OP learned that because he's hitting his ISP's recursive resolver, he doesn't notice downtime until the TTL of the previous response expires.
Solution (which everyone else does, and OP has now implemented): don't use your ISP's recursive resolver! Run your own instance of bind9 or whatever, with the cache disabled. Or it seems like `dig +trace` would probably work, too.
"Cached resources remain visible for their whole TTL, even if the original becomes unreachable or changes" seems like one of the very first principles someone should learn when deciding to go into business selling an uptime monitoring service.
It's not a "ghost domain," it's a Time-To-Live field!
I get where you're coming from, but I think the part you're missing is that caching is not optional (per the "with the cache disabled" comment). If you're running a service like theirs, you need to do caching to some degree.
My understanding of their implementation is that it does caching on a per-worker basis with lower TTLs, so it balances caching with accuracy, but doesn't "fix" visibility to the problem either. However, it narrows the window such that their method of monitoring will very likely expose a pulled domain sooner.
One approach to solving this for a very limited set of intervals is to actually block namespace that has been removed at the registry level. There is a paper on this from Raffaele Sommese:
https://static.sched.com/hosted_files/icann83/5b/Rafaelle%20...
Quad9 (9.9.9.9) consumes this feed from U Twente of "just deleted" names, as most of them are malicious, and blocking them even if they are NOT malicious causes zero harm. Currently, this is only names that are very short-lived, so may not catch the longer intervals where names are deleted and become ghosts.
Another model using something similar would be to specifically clear those "just-deleted" name cached entries out of the recursive resolver, but that is expensive. Also, with blocking instead of removal it is possible to get high-level metrics on how often those are being abused where NXDOMAIN tracking is not measured in the same dimensions.
(disclaimer: I work for Quad9)
Technically every domain is a ‘ghost’ domain until TTL expires and NS RRs are usually cached for very long.