"Cloudflare CEO is lying" is a bit of an aggressive take when he linked to the exact data so you can see it for yourself - and that's how this article was able to analyze it: https://radar.cloudflare.com/traffic#bot-vs-human
"Thought it would be end of 2027, then early 2027, but agentic traffic growing so fast that bots have now passed human traffic online for the first time in the Internet's history."
But the quoted segment in the article was just "…bots passed human traffic online for the first time in the Internet’s history."
It looks to me like the data supports "bots passed human traffic" but does NOT support "agentic traffic", since more of that traffic is from AI crawlers building indexes than from agents that are browsing the web on behalf of their owners.
If that's the point the article is trying to make then the headline is a little more supported, though I'd still say it's too hype-y a headline.
I guess a lot of this rests on what you assume "agentic traffic" to mean.
It's aggressive but true. Spam traffic exceeded real email back in 2003. He's not even in the right DECADE to declare a first for non-human traffic. It's pure marketing BS that is only remotely true if you accept the conditions he goes to lengths to hide.
Misleading headline statement, no clarifying statements later that is restricted to HTTP traffic, graphs chosen to support the misleading message. Putting out the truth behind an asterisk isn't being honest.
> It looks to me like the data supports "bots passed human traffic"
I think you are missing the fact that the dashboard has HTML pre-selected as a filter. Once you change that to all content types, you’ll see humans account for twice as much traffic as bots.
Note this part of the article:
> The CEO ignored the all-traffic number, on his own dashboard, and instead published the HTML-only number as a fact about the whole internet.
Personally I think that limiting the stats to just HTML pages is a credible stat on which you could claim that "bots passed human traffic". I don't care about bots downloading other types of content.
The most important thing is to link to your methodology, which Cloudflare's CEO did in this case.
Contrast a human user with a naïve bot implementation when they come across an SPA. The human user will load the HTML once then fetch plenty of JSON as they use it. The bot will fetch the HTML and move on. If you filter by HTML only, then an extended human session lasting hours would be counted as equal to a bot immediately bouncing.
The tweet implied there was a shift but for some reason radar is cutting off everything except recent data for just this one time series so we can't verify his storyline.
I deal with scrapers that sometimes border on DDoSes for LessWrong. The amount of bot traffic varies greatly between sites; if you have more URLs you get more bot traffic (regardless of whether those URLs represent a deep content catalog, or useless URL parameter permutations). It's bad for LW because of the content-catalog depth.
It's easy to drastically underestimate the amount of bot traffic, because bots make efforts (of varying sophistication) to look human enough to evade blocking. That includes using fake user-agent strings corresponding to real browsers (often but not always with implausibly old version numbers), proxying through residential IPs, and sometimes using full headless browsers. In my own data, traffic from badly behaved browser-impersonation bots exceeds traffic from named scrapers like GPTBot by something like 10x.
The measured percentage of bot traffic is higher for HTML than for other content types because many bots will load an HTML page, and then not load the JS/CSS/image/etc resources it references. But these are the least-sophisticated and most-detectable bots.
When it comes to residential IPs, that you mentioned, these can only be afforded by scrapers that were specifically made for your website and have a financial incentive. I don't believe that someone would spend money on residential IPs just to crawl the entire internet.
Browser/IP impersonation bots come from DC network, and there are a dozen or so ASNs where they typically live.
General crawlers, from SEO, search engines, meta, alibaba, etc, usually follow robots.txt
The result: the real pain is only the first category, where data from your website has some financial value. But this isn't an infinite number of bots — depending on the business, they're countable amount.
on e.g. our preproduction site, Meta is the only big-tech crawler that accesses it, at least with an honest user agent. (Meta also accesses disallowed paths on the production site.)
There is a unfortunate incentive created when a "business" (MiTM) depends on "bot traffic", i.e., the continued nuisance of bot traffic, to make money
If the "bot traffic" declines, then the "bot protection business" goes down with it
Cloudflare communication are sometimes careful to refer to traffic _labeled as_ bot traffic versus actual bot traffic
Because the "business" relies on the existance of "bot traffic", theres an incentive to broaden the scope of what is labeled as "bot traffic"
The false positive rate can be high. The public should see those statistics, and in truth it may be infeasible to compile them when theres no verification and the entire system relies on heuristics
"Bot protection" can be used to gather fingerprints for marketing
It can be used to force users to use certain software, e.g., certain browsers, and to enable Javascript subjecting users to data collection, surveillance and ads
Originally the motivation for avoiding "bot traffic" was based on behaviour, e.g., exceeding acceptable rates of usage, making too many requests in a given time period, exceeding rate limits
Now it's available to exclude traffic based on criteria such as what browser someone is using. NB. This is more than "user-agent string". The company forces people to sign NDAs before telling them what it is doing to fingerprint www users
If residential proxies are the problem then why not go after the companies that provide them
The truth is that those companies are not the problem. Their customers are so-called "tech" companies
Perhaps it's these so-called "tech" companies that are the problem
Certainly the problem is not the individual www user who doesnt use an "approved" graphical, Javascript-enabled browser who gets blocked or fingerprinted trying to make a single request
But thats who suffers from "bot protection" so that so-called "tech" companies can profit from data collection, surveillance and ads
> Originally "bot traffic" was based on behaviour, e.g., exceeding acceptable rates of usage, making too many requests in a given time period, exceeding rate limits
> Now it's available to exclude traffic based on criteria such as what browser someone is using
I'm pretty sure user-agent-based bot detection predates every request-rate-based method by quite a few years.
>It can be used to force users to use certain software, e.g., certain browsers, and to enable Javascript subjecting users to data collection, surveillance and ads
>Certainly the problem is not the individual www user who doesnt use an "approved" graphical, Javascript-enabled browser who gets blocked or fingerprinted trying to make a single request
The alternatives to javascript fingerprinting are either ineffective (TLS fingerprinting and/or IP rate limits), or even worse for privacy (eg. attestation).
>If residential proxies are the problem then why not go after the companies that provide them
> The alternatives to javascript fingerprinting are either ineffective (TLS fingerprinting and/or IP rate limits), or even worse for privacy (eg. attestation).
Javascript fingerprinting itself is ineffective, these kind of checks only stop the most basic bots and I'd argue the same for attestation.
I run some moderate profile gov and ngo opendata sites, and I’d say that bot like traffic is 99% of the requests we’re seeing on some sites.
Mostly current valid user agents, lots of ip addresses, but the traffic patterns are not organic. I’m not clear if it’s bad ai scraping or dos, but at some level it’s indistinguishable.
I host my website on Cloudflare and looking at my stats, I can confirm that bot traffic is not only up, it's totally insane.
There is nothing on the site, just my name and an email account and it should have no traffic at all.
Instead in the last 30 days I've had 26K requests. 93% resulting in a 4xx error as most of these bots seem to be looking for vulnerabilities in various platforms like WordPress.
I tested this theory not long ago and did not see anything that aligned with the hype around bots. [1] There are indeed more bots than humans because of course there are or at least the appearance of. Bots crawl everything linked from popular sites whereas humans only click on things that interest them and even then they do not typically siphon the entire site. There are new bot operators every day due to curiosity and FOMO.
The only thing I saw that could possibly be construed as abusive were some poorly configured RSS bots. Even when my server told the bot that the page would not change for 4 hours the RSS bots would check every 10 minutes meaning they are ignoring the cache-control header. This was entirely harmless, just slightly annoying. The RSS bots are not new. Most of the bots are not even trying to disguise themselves as humans. Most bots are not programmed to parse cache-controls, rel tags or fetch robots.txt meaning they only follow the pirate code. A bot will do what a bot can do.
I was expecting the bots to mirror a couple git repositories I exposed but they did not go deeper than the README.md. None of them. I think this is the same pattern of catastrophization that exists around AI dooming the world and I don't know why it is spreading. I guess it must work or people would not do it.
My employer's site was recording 1,500 requests per second from a single AI bot earlier this week. The requests came from 2.4 million different IPs at the time I looked, between 1-2 requests from each IP, most likely all were unique URLs. That single bot was 55% of traffic. This kind of crawling pushes us to (sometimes beyond) the limit of our capacity.
I have also seen thousands of requests per hour from the IP to a small set of pages, e.g. the homepage. I don't know why; it doesn't matter so I ignore it.
I've recently found there are websites offering curated "AI ready" datasets, and several of these sites claim to have indexed our site, on the 3-4 I looked at it was one of a few hundred datasets. It's interesting enough to be something an AI company would want, so my conclusion is the site is being specifically targeted by the AI bot developers.
the site is being specifically targeted by the AI bot developers
As I was reading your comment it sounded like a targeted attack. I think you are right that it was targeted. I assume you have done research on what content could be rate limited by URI target vs. source IP and give people a message saying content temporarily unavailable due to AI bot attack?
Is the concern that your site is being DDoS'd or that they are reselling your copyrighted material? If reselling I would get corporate lawyers involved and seek damages I am not a lawyer. Feds could subpoena some of the providers for identity of the attackers.
If the concern is DDoS have your team done any analysis of the clients to see what is in common? Based on the number of IP address you are talking about I assume it must be from wireless carriers. Have you looked at TCPSYN TTL and other characteristics? If there is anything in common those connections could be routed internally to another listener that has tighter rate limits meaning that perhaps cellular users could find some content not available until the bots go away or they randomly get one of a dozen different captchas or random javascript puzzles to access each document until the storm subsides. The puzzles could probably be regenerated hourly by AI to keep the attackers on their toes. Another option would be to require an account to access the documents and limit the number of documents each account can download per hour and / or day and / or week then add more friction to account creation or limit account creation to address space of countries you do business with after blocking most proxies and VPN providers.
Another option to limit the blast zone of an attack is to block countries that one does not do business in but that depends on your business model.
CDN's like Cloudflare are not doing anything magic. If they can block the bots so can just about anyone else. Without seeing samples of the attacks I could not make many more suggestions.
It's not copyright data (academia / library stuff), so the concern is the DDoS. If researches write to us, we send them a database dump or export, but only one AI company has ever written.
So far there's always been some pattern to allow a block/challenge, e.g. user agent, JA3 / JA4, ASN. (I haven't looked at TCP SYN TTL before.) Usually the IPs are 80% or so in one country (e.g. Brazil, US, Vietnam or India) with the rest all over the world, mostly consumer ISPs although I haven't distinguished between fixed line and mobile.
We tried Cloudflare for a couple of months, on a paid plan, which I think blocked many of the non-distributed crawlers, but didn't help much with these distributed ones.
Meanwhile we have been reducing the cost of rendering the pages.
Usually the IPs are 80% or so in one country (e.g. Brazil, US, Vietnam or India)
It sounds like there is the option to at least reduce the load by up to 80%. That's at least a start.
This repo [1] is not perfect but it's a start. I would disable IPv6 access to the site after removing the IPv6 DNS records and waiting a day so that attackers are forced onto IPv4 and clone this repo [1] or use one of the GeoIP databases to limit access from/by specific countries.
That repo also contains known proxies. That may account for another percentage of that remaining 20%.
As the content is academic in nature I don't know how your team feels about blocking Tor, but there is also a list of many (not all) of the last 30 days of Tor exit nodes in that repo.
The blocks would not have to be permanent, just enabled during the storms if your team so desired.
Example of country IP addresses for Brazil [2]
As this is academia I don't know if there is a concept of service level agreement or a promise of availability, but during attacks the requests to specific URL's could be redirected to a static pre-compressed landing page served out of memory that says "Access to these documents limited during AI bot attacks, here is where to request a full download instead: "
I forgot to mention, many of the AI bots are limited to HTTP/1.1. During an attack your server could redirect those request to a static page as well. Curious if you can tell by your access logs if the majority of the attack is HTTP/1.1 or HTTP/2.0.
Some bots also do not hide that they are bots in that they say they are bot in their user-agent client header. That would be too easy so I doubt that is the case.
Anecdotally the site we manage we are easily seeing 100x the traffic from bots than from humans in the past year. So much so that it is impacting our hosting costs.
All traffic includes stylesheets, JS, images and so on that your typical browser will faithfully retrieve to render a page but most bots or scrapers will ignore some or all of them.
"Lying" is not supported by the evidence. In the context of bot traffic on the web, looking at only GETs for HTML is a reasonable approach. If you're counting all requests for all assets then a single page view of nytimes.com would count 100x as much as one for HN.
I would assume a lot of people running websites tend to think in pageviews, especially when dealing with bots because images and CSS files tend to be "cheap" static content but HTML requests are often dynamically generated.
It's also a single tweet that links to the data used to "disprove" it. Would be a weird way to lie.
The article strikes me as quite uncharitable to characterize it as "a lie". I doubt this CEO just sat down and calculated he was going to write lies. While it's fine to call out he's wrong per his own fuller data set, it's quite a different thing than calling the person out as a "liar" in a rage-bait fashion.
The author is also not very good at interpreting data himself. He claims "the AI number is padded by counting Googlebot twice" and links to [1], but there is nothing on that page that could support that assertion. It looks like he misinterpreted this part: "Googlebot crawls for both search indexing and AI training and is included as a separate entry due to its crawl volume" (Googlebot was NOT included in the "AI bot" count, so it was not counted twice.)
Not sure if the Cloudflare CEO is lying, but I have a pixel deployed on tens of thousands of sites offering B2B solutions, and bot traffic overtook human traffic this year.
It's a tracking tool. You have a bunch of sites embed an image, and requests to those sites also make requests to said image, which you can use to start tracking a client. A single pixel is merely the cheapest image.
I recall Facebook doing it years ago, I imagine they still do.
A 'pixel' is an unobtrusive (as in, not seen by the user like a banner ad is seen) asset* served on a web page that can cause the user's user agent to make an affirmative web request from you, a third party, so you know someone was at the site serving your pixel.
Typically used for:
- tracking in general, as well as more specifically:
- retargeting
- conversion
* Note: Doesn't have to be a literal pixel, but a literal transparent pixel is least likely to get blocked. Serve your pixels from the end of a parameterized path (/some/param/or/other/pixel.gif) and it's not seen as query string tracking either.
I concur and have been talking about this for a while.
The fact is, Cloudflare is a man-in-the-middle. That's their focus, that's their purpose.
They will limit your local crawler from accessing pages. They will demand you use their crawler.
They will decrypt your traffic if they get a warrant. They always decrypt your traffic anyway, but they will give it to state actors if they demand it.
That's not to say anyone should break the laws, but the issue right now is that intellectual property is incompatible with what is coming with AI.
I don't hate on Cloudflare because it's a bad service. It's actually pretty good, but the fundamental problem is they make their purpose to be a single choke point of all data on the Web.
The article is a bit too strong, aggressive I’d even say. Content is loaded only if the bot executes JavaScript and loads all content willingly. These do exist, but they are more expensive to run than a basic curl bot.
It’d make sense as you might not want your bot to load everything a real human would do (ie: analytics, ads, unrelated files, etc..) and only focus on the content.
Also, am I the only one surprised that bot traffic is not the majority already? For my site, it’s x100 bots for every human.
Cloudflare is junk. Their entire billion dollar service can't distinguish my (DAILY) GET request to mainstream news sites from bot traffic, nothing they say or do is of any value. I've had the same IP for decades.
Have you checked your IP address's reputation with a service such as ipqualityscore.com? If cloudflare thinks your traffic is bot traffic, it's likely that there is bot traffic you don't know about coming from your IP, either from a compromised device on your network or a sketchy VPN product.
Also, it doesnt apply to this person if they've had the same IP for years but if your ISP rotates IPs frequently (mine does everytime I reset modem) or you use 5G and CGNAT is being used, its almost a garuntee that your proxy has been labled as having been used by a residential proxy network.
So many people have sketchy TV boxes or whatever other sketchy IOT decice that is a larp for using your network to sell bandwidth to proxy networks.
However, CF is unnecessarily making people on 5G connections from desktops do turnstiles as it looks like a scraper using a mobile proxy. This will become more and more of an issue as more laptops have 5G modems in them. Not sute how this WAF IP fingerprinting model survives widesprear CGNAT. I guess it will be an excuse to more intensly fingerprint us.
>However, CF is unnecessarily making people on 5G connections from desktops do turnstiles as it looks like a scraper using a mobile proxy. This will become more and more of an issue as more laptops have 5G modems in them.
What makes you think this is the cause, rather than something more straightforward like: CGNAT means more users are sharing IPs, and there's a higher chance that the IP pool gets contaminated by bad behavior? Apparently cloudflare tries to detect CGNAT pools and give them more leeway, but at the same time they can't give them unlimited leeway.
Some people think they're providing value. They do block some bot traffic (at the cost of many many false positives) and I'll bet that they're providing endless amounts of data to one or more three letter agencies. They also provide lots of value to scammers who use cloudflare to protect their phishing and malware pushing webpages from automated systems that would detect/report them while also making sure visitors have JS enabled so their browser exploits work.
It can be worse; they randomly block my uptime monitoring with 4xx and 5xx status codes once in two months or something like that, despite nothing changing.
Elaborate. Are you getting turnstile challenges? Or blocked entirely? Are you using a weird browser configuration or non-mainstream browser? Do you have extensions that lie about your software/hardware configuration?
That could be plausible deniability. I mean, CF is in fact keeping a tab on who is visiting which websites. Between them and Google, these two companies know everything about everyone.
Why did we start treating Cloudflare (a public, for-profit company) as the undisputed authority on anything related to the network layer of the internet in the first place?
Because they inserted themselves into almost everything we do online and basically managed to take control over it. Cloudflare should never have been allowed to man in the middle the entire internet, but now that they have they're the only man on earth with a dataset that size.
(I'm ex CF) This is backwards. Nobody "allowed" anyting. CF serves a customers need. You can argue with the solution but you can't argue with the core problem. It's more healthy to start the conversation of _why_ CF services are valuable.
CF serves something it convinced customers they need.
Static blogs hiding behind bot protection (in some cases blocking legit users from GrapheneOS because it's difficult to fingerprint them) because someone convinced them they'll be DDoSed by bots otherwise is a loss to the Internet.
A lot of self-hosters running CF tunnels because they don't know better also contributes.
> It's more healthy to start the conversation of _why_ CF services are valuable.
Begging the question. It's what TFA is about - telling people they need CF.
> A lot of self-hosters running CF tunnels because they don't know better also contributes.
Are you saying CF documentation is better than Computer Science / Networking education resources? Why don't people know better? I thought the tunnels are mostly used to bypass NAT's.
> Static blogs hiding behind bot protection
I'm not sure what is the proportion of the static vs dynamic sites, but I would argue that for wordpress CF is adding real value.
> I thought the tunnels are mostly used to bypass NAT's.
While not free, you can do with with TCP HAProxy streams on a cheap VPS. A lot of people using them to bypass NAT don't realise that Cloudflare decrypt the traffic on the way - that's what I meant about them not knowing better.
It's not about being incompetent. Quite often in selfhosted subreddits and forums you will see people surprised that Cloudflare can see their traffic in plaintext.
Of course, they probably don't, but the fact that they can and that their policies now influence XX% of internet traffic is bad for the open internet.
I think it counts as "allowed" regardless of utility. CF is so massively over-weight on the internet that it's impossible to trust them with anything because if they can be forced to do something by a hostile government (hint: they can be!) then they can get away with it invisibly and affect billions of people.
That is something that should not be allowed to exist. It's one of the reasons monopolies (or even majority-opolies) are bad. It's a weapon hanging on the wall, waiting to be used.
Then I think the real question is why haven't any serious competitors emerged that can handle the essential services that Cloudflare provides?
Are there network effects like what happens with Microsoft in the business computing space? With Microsoft, I'm also aware of a great amount of anti-competitive behavior, and though I haven't seen that from Cloudflare personally and haven't heard accusations of it, I also haven't paid attention.
When I learned econ 101 in high school there was a concept of a "natural monopoly" like an electricity utility, a concept that was probably mostly post-hoc rationalization of the regulatory structures that were chosen a century ago, but it at least was a coherent narrative. I can't see any coherent narrative about Cloudflare's services being a natural monopoly. So I'm left wondering if they are just way better at what they do than anybody else, and perhaps the space isn't big enough to drive a competitor to enter it?
I hope somebody on HN has a much better explanation of this than I do.
I suspect a big part of it is that CF is running other businesses on the side, and offering basic features at a loss - they've artificially depressed the price of the service so it's hard to compete with them on only that service. It gets them a lot of customers who then buy other profitable things later.
Everyone using the free service likes that, of course, but honestly I wish we'd make it illegal to do. It's heavily used as a way to steal small markets simply by being successful in a different large one.
Seems more of a moral hazard of government intervention than of ventures whose economies of scale demand large market cap to most economically serve customer needs.
Regulation could conceivably disallow a single company to control such a significant portion of internet traffic. The parent can be interpreted as lamenting the absence of such regulation.
It's certainly a huge risk when one company's outage can take down so much of internet. Decentralization and the ability to route around damage was a core feature of the internet. The more we depend on the internet, the less acceptable that loss becomes.
cloudflare (and CDNs more generally) certainly started out as an own goal. Warnings were ignored and most people did the easy thing instead of the right thing.
I believe it is the site administrators who have inserted Cloudflare in between their sites and their users.
Usually it is done for rational reasons of establishing a protection against bots. But what is less rational, in my opinion, is when everyone uses the same provider for that.
Because it indirectly turns Cloudflare into a monopoly. And monopolies often converge to a state when they start to abuse their position.
The amount of people in these comments demonizing Cloudflare services or conflating their very existence with rubbish/trash/nonsense because you personally know better is wild.
Y'know what would be better? Making a site showcasing what tools are better than Cloudflare's services! And how to use them! And sharing said site so people know about them!
Yeah, I've seriously considered finding or building a CF-protected-detector browser extension to flag domains. Having one company MITMing so much traffic is straightforwardly dangerous, and not just an annoyance. We need competition.
I kind of do the same, not every time, but sometimes if I keep getting it on the same site, I seriously question how accurate their stats are without deep diving them more.
switching to tmobile home internet has been eye opening to me on how different the internet can be from person to person. you don't even get your own ipv4 address. makes me realize the challenge behind blocking something like yt-dlp
I confess a sad assumption that bot traffic is far higher than we have admitted for a long time. Though, maybe we would see different stats specifically to social media sights to astroturf like counts? Certainly feels that we have known for a long time that bots were larger in ad viewing than ad companies wanted to admit.
I don't understand what difference bots make. For me, a website (the public part) is a storefront. People walk down the street and see what's inside — that's the purpose. If something should not be available immediately, that's the private part of the store.
I've been monitoring bot traffic on digital platforms for over 10 years. Sure, the crawler share is growing, some even with malicious intentions, and those I detect and block.
I disagree that this pain is worth the cost of making real people spend their life on verification.
For ad views, the concern is specifically that people pay for clicks and views. That that can be so heavily influenced by bot traffic greatly undermines their value.
Same general idea goes for any of the algorithmic driven platforms. The algorithms are ostensibly intended to surface organically discovered things by watching how people interact with things. That they are so susceptible to distortion through bot farms should be a lot more acknowledged than it is. People trust them far more than they should.
There is also a general cost of running things concern. It isn't like it is completely free to execute on bot traffic.
For ads, I believe this must be a problem for ad platform owners.
If the digital platform's storefront is their business, they could afford to spend some budget on bot detection. Bots still come from data center networks, sometimes render pages incompletely, request resources in bulk, and show enough patterns to be flagged internally.
If we look at a medium website, most random crawlers will come from Amazon, Microsoft, DigitalOcean, Hetzner, OVH, and a few other DC networks — these can be blocked easily without harming real users. The rest can be detected and cleaned up, even manually.
The math is simple: 20,000 visits a day at 15 seconds each = ~83 hours a day lost watching a Cloudflare logo, just because someone doesn't want to dig into the logs. I don't buy it.
Largely agreed, though I think you are likely underestimating how hard this is to detect. In particular, it is true that many bots can be hosted in data centers, but it is somewhat trivial to launder that traffic through other sources. Malware, in particular, is what I have in mind. Maybe I'm wrong and that has largely gone away?
There is also a bit of mixed incentives. Yes, it is the ad platform that is getting abused. But it is also the ad platform that is charging people based on abused practices.
And it isn't like this is completely made up. Just look at how facebook killed a lot of ton of people during the "pivot to video" programs. I don't know all of the details, as I was thankfully not in any of the involved industries, but my understanding is it is fairly well documented.
Edit: I changed an "isn't" to "is." I think I was trying to reword at one point, but left it in a way that is opposite what I meant.
When most of your server capacity is going to answering the scrapers it matters. It's not that the stuff is hidden, it's that storefront being flooded with 10x as many customers as the fire code allows. And some of them go around asking your employees mindless questions. (Small forum I help moderate: we were getting hammered with what was probably some sort of AI that was taking search queries and feeding them into the forum search. Search is now registered users only.)
Well the fun things is that no one knows how much traffic of what kind they are getting when they use Cloudflare.
You get the numbers that Cloudflare tells you, but who knows if you can trust their stats after their CEO is apparently cherry-picking data to shape their product narrative?
That same CEO too that just went on a wild tone-def layoff justification, classifying human employees into roles of either a builder, seller, or measurer and saying he wants to get rid of everyone that "measures" the business...
I wouldn't trust a single thing coming out of his mouth.
Do people really expect CEOs to be knowledgeable about any technically details in 2026? My experience is that CEOs are getting increasingly out of touch with what their employees actually do and what their customers want.
"Cloudflare CEO is lying" is a bit of an aggressive take when he linked to the exact data so you can see it for yourself - and that's how this article was able to analyze it: https://radar.cloudflare.com/traffic#bot-vs-human
Update: I see the problem. Here's the full tweet: https://x.com/eastdakota/status/2062212701414187452
"Thought it would be end of 2027, then early 2027, but agentic traffic growing so fast that bots have now passed human traffic online for the first time in the Internet's history."
But the quoted segment in the article was just "…bots passed human traffic online for the first time in the Internet’s history."
It looks to me like the data supports "bots passed human traffic" but does NOT support "agentic traffic", since more of that traffic is from AI crawlers building indexes than from agents that are browsing the web on behalf of their owners.
If that's the point the article is trying to make then the headline is a little more supported, though I'd still say it's too hype-y a headline.
I guess a lot of this rests on what you assume "agentic traffic" to mean.
It's aggressive but true. Spam traffic exceeded real email back in 2003. He's not even in the right DECADE to declare a first for non-human traffic. It's pure marketing BS that is only remotely true if you accept the conditions he goes to lengths to hide.
What lengths did he go to in order to hide them?
Misleading headline statement, no clarifying statements later that is restricted to HTTP traffic, graphs chosen to support the misleading message. Putting out the truth behind an asterisk isn't being honest.
> It looks to me like the data supports "bots passed human traffic"
I think you are missing the fact that the dashboard has HTML pre-selected as a filter. Once you change that to all content types, you’ll see humans account for twice as much traffic as bots.
Note this part of the article:
> The CEO ignored the all-traffic number, on his own dashboard, and instead published the HTML-only number as a fact about the whole internet.
Personally I think that limiting the stats to just HTML pages is a credible stat on which you could claim that "bots passed human traffic". I don't care about bots downloading other types of content.
The most important thing is to link to your methodology, which Cloudflare's CEO did in this case.
Contrast a human user with a naïve bot implementation when they come across an SPA. The human user will load the HTML once then fetch plenty of JSON as they use it. The bot will fetch the HTML and move on. If you filter by HTML only, then an extended human session lasting hours would be counted as equal to a bot immediately bouncing.
The tweet implied there was a shift but for some reason radar is cutting off everything except recent data for just this one time series so we can't verify his storyline.
The graph only goes back to May 3rd 2026. I guess that was the start of humanity
I deal with scrapers that sometimes border on DDoSes for LessWrong. The amount of bot traffic varies greatly between sites; if you have more URLs you get more bot traffic (regardless of whether those URLs represent a deep content catalog, or useless URL parameter permutations). It's bad for LW because of the content-catalog depth.
It's easy to drastically underestimate the amount of bot traffic, because bots make efforts (of varying sophistication) to look human enough to evade blocking. That includes using fake user-agent strings corresponding to real browsers (often but not always with implausibly old version numbers), proxying through residential IPs, and sometimes using full headless browsers. In my own data, traffic from badly behaved browser-impersonation bots exceeds traffic from named scrapers like GPTBot by something like 10x.
The measured percentage of bot traffic is higher for HTML than for other content types because many bots will load an HTML page, and then not load the JS/CSS/image/etc resources it references. But these are the least-sophisticated and most-detectable bots.
thank you for maintaining LessWrong
When it comes to residential IPs, that you mentioned, these can only be afforded by scrapers that were specifically made for your website and have a financial incentive. I don't believe that someone would spend money on residential IPs just to crawl the entire internet.
Browser/IP impersonation bots come from DC network, and there are a dozen or so ASNs where they typically live.
General crawlers, from SEO, search engines, meta, alibaba, etc, usually follow robots.txt
The result: the real pain is only the first category, where data from your website has some financial value. But this isn't an infinite number of bots — depending on the business, they're countable amount.
Meta comes through with a /24 worth of scrapers and ignores robots.txt. I'm inclined to poison my data with fake information about Zuckerberg.
Did you check IP addresses, are they all from AS32934?
Yes
57.141.0.42 - - [05/Jun/2026:19:50:19 +0000] "GET /mid/a017bc62-0982-42db-8403-241d69da8d0f@alexander-goetzenstein.my-fqdn.de HTTP/2.0" 303 0 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"
57.141.0.48 - - [05/Jun/2026:19:50:22 +0000] "GET /group/comp.os.linux.advocacy/a/a236f5a5-63a4-4982-8bb6-07ffc684201b@googlegroups.com HTTP/2.0" 200 34838 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"
57.141.0.55 - - [05/Jun/2026:19:50:23 +0000] "GET /group/alt.recovery.aa/a/ne6onq%24hpp%241@dont-email.me HTTP/2.0" 200 5606 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"
57.141.0.56 - - [05/Jun/2026:19:50:24 +0000] "GET /group/aioe.news.assistenza/a/qpukie%241i1g%241@neodome.net?view=headers HTTP/2.0" 200 17027 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"
57.141.0.36 - - [05/Jun/2026:19:50:29 +0000] "GET /group/alt.obituaries/a/uf8pej%241hqi1%241@news.xmission.com HTTP/2.0" 200 6123 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"
57.141.0.66 - - [05/Jun/2026:19:50:29 +0000] "GET /group/comp.theory/a/v3640k%24vg63%243@dont-email.me HTTP/2.0" 200 148720 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"
And assume you have
User-agent: meta-externalagent
Disallow: /
I have observed the same from Meta's crawler.
on e.g. our preproduction site, Meta is the only big-tech crawler that accesses it, at least with an honest user agent. (Meta also accesses disallowed paths on the production site.)They don't obey *, they don't get their own entry. I'd rather just poison their data, it's a well known behavior from them.
https://www.reddit.com/r/webdev/comments/1sdzd1q/metas_ai_cr...
Does LW have a downloadable archive? I can only find references to GreaterWrong but no public answer. Would be useful.
There is a unfortunate incentive created when a "business" (MiTM) depends on "bot traffic", i.e., the continued nuisance of bot traffic, to make money
If the "bot traffic" declines, then the "bot protection business" goes down with it
Cloudflare communication are sometimes careful to refer to traffic _labeled as_ bot traffic versus actual bot traffic
Because the "business" relies on the existance of "bot traffic", theres an incentive to broaden the scope of what is labeled as "bot traffic"
The false positive rate can be high. The public should see those statistics, and in truth it may be infeasible to compile them when theres no verification and the entire system relies on heuristics
"Bot protection" can be used to gather fingerprints for marketing
It can be used to force users to use certain software, e.g., certain browsers, and to enable Javascript subjecting users to data collection, surveillance and ads
Originally the motivation for avoiding "bot traffic" was based on behaviour, e.g., exceeding acceptable rates of usage, making too many requests in a given time period, exceeding rate limits
Now it's available to exclude traffic based on criteria such as what browser someone is using. NB. This is more than "user-agent string". The company forces people to sign NDAs before telling them what it is doing to fingerprint www users
If residential proxies are the problem then why not go after the companies that provide them
The truth is that those companies are not the problem. Their customers are so-called "tech" companies
Perhaps it's these so-called "tech" companies that are the problem
Certainly the problem is not the individual www user who doesnt use an "approved" graphical, Javascript-enabled browser who gets blocked or fingerprinted trying to make a single request
But thats who suffers from "bot protection" so that so-called "tech" companies can profit from data collection, surveillance and ads
> Originally "bot traffic" was based on behaviour, e.g., exceeding acceptable rates of usage, making too many requests in a given time period, exceeding rate limits
> Now it's available to exclude traffic based on criteria such as what browser someone is using
I'm pretty sure user-agent-based bot detection predates every request-rate-based method by quite a few years.
>It can be used to force users to use certain software, e.g., certain browsers, and to enable Javascript subjecting users to data collection, surveillance and ads
>Certainly the problem is not the individual www user who doesnt use an "approved" graphical, Javascript-enabled browser who gets blocked or fingerprinted trying to make a single request
The alternatives to javascript fingerprinting are either ineffective (TLS fingerprinting and/or IP rate limits), or even worse for privacy (eg. attestation).
>If residential proxies are the problem then why not go after the companies that provide them
> The alternatives to javascript fingerprinting are either ineffective (TLS fingerprinting and/or IP rate limits), or even worse for privacy (eg. attestation).
Javascript fingerprinting itself is ineffective, these kind of checks only stop the most basic bots and I'd argue the same for attestation.
Most instances I've seen people paying for cloudflare main motivator was load balancing or DDOS protection. Obviously anecdotal ...
I run some moderate profile gov and ngo opendata sites, and I’d say that bot like traffic is 99% of the requests we’re seeing on some sites.
Mostly current valid user agents, lots of ip addresses, but the traffic patterns are not organic. I’m not clear if it’s bad ai scraping or dos, but at some level it’s indistinguishable.
If you can email me, I'd be happy to volunteer some help looking into this for your org, as we've made some tool to investigate bots (open-source).
I host my website on Cloudflare and looking at my stats, I can confirm that bot traffic is not only up, it's totally insane.
There is nothing on the site, just my name and an email account and it should have no traffic at all.
Instead in the last 30 days I've had 26K requests. 93% resulting in a 4xx error as most of these bots seem to be looking for vulnerabilities in various platforms like WordPress.
I tested this theory not long ago and did not see anything that aligned with the hype around bots. [1] There are indeed more bots than humans because of course there are or at least the appearance of. Bots crawl everything linked from popular sites whereas humans only click on things that interest them and even then they do not typically siphon the entire site. There are new bot operators every day due to curiosity and FOMO.
The only thing I saw that could possibly be construed as abusive were some poorly configured RSS bots. Even when my server told the bot that the page would not change for 4 hours the RSS bots would check every 10 minutes meaning they are ignoring the cache-control header. This was entirely harmless, just slightly annoying. The RSS bots are not new. Most of the bots are not even trying to disguise themselves as humans. Most bots are not programmed to parse cache-controls, rel tags or fetch robots.txt meaning they only follow the pirate code. A bot will do what a bot can do.
I was expecting the bots to mirror a couple git repositories I exposed but they did not go deeper than the README.md. None of them. I think this is the same pattern of catastrophization that exists around AI dooming the world and I don't know why it is spreading. I guess it must work or people would not do it.
[1] - https://blawg.nochan.net/b/Internet-Crap/20260522-Maybe-AI-B...
My employer's site was recording 1,500 requests per second from a single AI bot earlier this week. The requests came from 2.4 million different IPs at the time I looked, between 1-2 requests from each IP, most likely all were unique URLs. That single bot was 55% of traffic. This kind of crawling pushes us to (sometimes beyond) the limit of our capacity.
I have also seen thousands of requests per hour from the IP to a small set of pages, e.g. the homepage. I don't know why; it doesn't matter so I ignore it.
I've recently found there are websites offering curated "AI ready" datasets, and several of these sites claim to have indexed our site, on the 3-4 I looked at it was one of a few hundred datasets. It's interesting enough to be something an AI company would want, so my conclusion is the site is being specifically targeted by the AI bot developers.
the site is being specifically targeted by the AI bot developers
As I was reading your comment it sounded like a targeted attack. I think you are right that it was targeted. I assume you have done research on what content could be rate limited by URI target vs. source IP and give people a message saying content temporarily unavailable due to AI bot attack?
Is the concern that your site is being DDoS'd or that they are reselling your copyrighted material? If reselling I would get corporate lawyers involved and seek damages I am not a lawyer. Feds could subpoena some of the providers for identity of the attackers.
If the concern is DDoS have your team done any analysis of the clients to see what is in common? Based on the number of IP address you are talking about I assume it must be from wireless carriers. Have you looked at TCPSYN TTL and other characteristics? If there is anything in common those connections could be routed internally to another listener that has tighter rate limits meaning that perhaps cellular users could find some content not available until the bots go away or they randomly get one of a dozen different captchas or random javascript puzzles to access each document until the storm subsides. The puzzles could probably be regenerated hourly by AI to keep the attackers on their toes. Another option would be to require an account to access the documents and limit the number of documents each account can download per hour and / or day and / or week then add more friction to account creation or limit account creation to address space of countries you do business with after blocking most proxies and VPN providers.
Another option to limit the blast zone of an attack is to block countries that one does not do business in but that depends on your business model.
CDN's like Cloudflare are not doing anything magic. If they can block the bots so can just about anyone else. Without seeing samples of the attacks I could not make many more suggestions.
It's not copyright data (academia / library stuff), so the concern is the DDoS. If researches write to us, we send them a database dump or export, but only one AI company has ever written.
So far there's always been some pattern to allow a block/challenge, e.g. user agent, JA3 / JA4, ASN. (I haven't looked at TCP SYN TTL before.) Usually the IPs are 80% or so in one country (e.g. Brazil, US, Vietnam or India) with the rest all over the world, mostly consumer ISPs although I haven't distinguished between fixed line and mobile.
We tried Cloudflare for a couple of months, on a paid plan, which I think blocked many of the non-distributed crawlers, but didn't help much with these distributed ones.
Meanwhile we have been reducing the cost of rendering the pages.
Usually the IPs are 80% or so in one country (e.g. Brazil, US, Vietnam or India)
It sounds like there is the option to at least reduce the load by up to 80%. That's at least a start.
This repo [1] is not perfect but it's a start. I would disable IPv6 access to the site after removing the IPv6 DNS records and waiting a day so that attackers are forced onto IPv4 and clone this repo [1] or use one of the GeoIP databases to limit access from/by specific countries.
That repo also contains known proxies. That may account for another percentage of that remaining 20%.
As the content is academic in nature I don't know how your team feels about blocking Tor, but there is also a list of many (not all) of the last 30 days of Tor exit nodes in that repo.
The blocks would not have to be permanent, just enabled during the storms if your team so desired.
Example of country IP addresses for Brazil [2]
As this is academia I don't know if there is a concept of service level agreement or a promise of availability, but during attacks the requests to specific URL's could be redirected to a static pre-compressed landing page served out of memory that says "Access to these documents limited during AI bot attacks, here is where to request a full download instead: "
I forgot to mention, many of the AI bots are limited to HTTP/1.1. During an attack your server could redirect those request to a static page as well. Curious if you can tell by your access logs if the majority of the attack is HTTP/1.1 or HTTP/2.0.
Some bots also do not hide that they are bots in that they say they are bot in their user-agent client header. That would be too easy so I doubt that is the case.
[1] - https://github.com/firehol/blocklist-ipsets/
[2] - https://github.com/firehol/blocklist-ipsets/blob/master/ipip...
Anecdotally the site we manage we are easily seeing 100x the traffic from bots than from humans in the past year. So much so that it is impacting our hosting costs.
All traffic includes stylesheets, JS, images and so on that your typical browser will faithfully retrieve to render a page but most bots or scrapers will ignore some or all of them.
A lot of folks in these comments complaining about Cloudflare, but not many suggesting alternative solutions.
"Lying" is not supported by the evidence. In the context of bot traffic on the web, looking at only GETs for HTML is a reasonable approach. If you're counting all requests for all assets then a single page view of nytimes.com would count 100x as much as one for HN.
I would assume a lot of people running websites tend to think in pageviews, especially when dealing with bots because images and CSS files tend to be "cheap" static content but HTML requests are often dynamically generated.
It's also a single tweet that links to the data used to "disprove" it. Would be a weird way to lie.
The article strikes me as quite uncharitable to characterize it as "a lie". I doubt this CEO just sat down and calculated he was going to write lies. While it's fine to call out he's wrong per his own fuller data set, it's quite a different thing than calling the person out as a "liar" in a rage-bait fashion.
The author is also not very good at interpreting data himself. He claims "the AI number is padded by counting Googlebot twice" and links to [1], but there is nothing on that page that could support that assertion. It looks like he misinterpreted this part: "Googlebot crawls for both search indexing and AI training and is included as a separate entry due to its crawl volume" (Googlebot was NOT included in the "AI bot" count, so it was not counted twice.)
[1] https://radar.cloudflare.com/year-in-review/2025
Not sure if the Cloudflare CEO is lying, but I have a pixel deployed on tens of thousands of sites offering B2B solutions, and bot traffic overtook human traffic this year.
What does pixel means in this context?
It's a tracking tool. You have a bunch of sites embed an image, and requests to those sites also make requests to said image, which you can use to start tracking a client. A single pixel is merely the cheapest image.
I recall Facebook doing it years ago, I imagine they still do.
https://advertising.amazon.com/resources/ad-policy/pixeling-...
A 'pixel' is an unobtrusive (as in, not seen by the user like a banner ad is seen) asset* served on a web page that can cause the user's user agent to make an affirmative web request from you, a third party, so you know someone was at the site serving your pixel.
Typically used for:
- tracking in general, as well as more specifically:
- retargeting
- conversion
* Note: Doesn't have to be a literal pixel, but a literal transparent pixel is least likely to get blocked. Serve your pixels from the end of a parameterized path (/some/param/or/other/pixel.gif) and it's not seen as query string tracking either.
https://en.wikipedia.org/wiki/Web_beacon, aka "tracking pixel", though these days it probably means a JS-based analytics reporting script.
Discussion: https://news.ycombinator.com/item?id=48387144
I concur and have been talking about this for a while.
The fact is, Cloudflare is a man-in-the-middle. That's their focus, that's their purpose.
They will limit your local crawler from accessing pages. They will demand you use their crawler.
They will decrypt your traffic if they get a warrant. They always decrypt your traffic anyway, but they will give it to state actors if they demand it.
That's not to say anyone should break the laws, but the issue right now is that intellectual property is incompatible with what is coming with AI.
I don't hate on Cloudflare because it's a bad service. It's actually pretty good, but the fundamental problem is they make their purpose to be a single choke point of all data on the Web.
That's not right. It never was.
Careful I posted something similar in another Cloudfare thread and people threw at me like lions.
They don't see anything wrong with one entity controlling most of the internet traffic
>They will limit your local crawler from accessing pages. They will demand you use their crawler.
Source? According to cloudflare, their crawling service don't get any special treatment from their WAF/CDN.
The article is a bit too strong, aggressive I’d even say. Content is loaded only if the bot executes JavaScript and loads all content willingly. These do exist, but they are more expensive to run than a basic curl bot.
It’d make sense as you might not want your bot to load everything a real human would do (ie: analytics, ads, unrelated files, etc..) and only focus on the content.
Also, am I the only one surprised that bot traffic is not the majority already? For my site, it’s x100 bots for every human.
Cloudflare is junk. Their entire billion dollar service can't distinguish my (DAILY) GET request to mainstream news sites from bot traffic, nothing they say or do is of any value. I've had the same IP for decades.
Have you checked your IP address's reputation with a service such as ipqualityscore.com? If cloudflare thinks your traffic is bot traffic, it's likely that there is bot traffic you don't know about coming from your IP, either from a compromised device on your network or a sketchy VPN product.
>it's likely that there is bot traffic you don't know about coming from your IP
Maybe from your IP block even. More common since you don't control that.
Google and Cloudflare mark whole ISPs as "sketchy" and there's fuck all you can do about it.
Also, it doesnt apply to this person if they've had the same IP for years but if your ISP rotates IPs frequently (mine does everytime I reset modem) or you use 5G and CGNAT is being used, its almost a garuntee that your proxy has been labled as having been used by a residential proxy network.
So many people have sketchy TV boxes or whatever other sketchy IOT decice that is a larp for using your network to sell bandwidth to proxy networks.
However, CF is unnecessarily making people on 5G connections from desktops do turnstiles as it looks like a scraper using a mobile proxy. This will become more and more of an issue as more laptops have 5G modems in them. Not sute how this WAF IP fingerprinting model survives widesprear CGNAT. I guess it will be an excuse to more intensly fingerprint us.
>However, CF is unnecessarily making people on 5G connections from desktops do turnstiles as it looks like a scraper using a mobile proxy. This will become more and more of an issue as more laptops have 5G modems in them.
What makes you think this is the cause, rather than something more straightforward like: CGNAT means more users are sharing IPs, and there's a higher chance that the IP pool gets contaminated by bad behavior? Apparently cloudflare tries to detect CGNAT pools and give them more leeway, but at the same time they can't give them unlimited leeway.
[1] https://blog.cloudflare.com/detecting-cgn-to-reduce-collater...
Some people think they're providing value. They do block some bot traffic (at the cost of many many false positives) and I'll bet that they're providing endless amounts of data to one or more three letter agencies. They also provide lots of value to scammers who use cloudflare to protect their phishing and malware pushing webpages from automated systems that would detect/report them while also making sure visitors have JS enabled so their browser exploits work.
It can be worse; they randomly block my uptime monitoring with 4xx and 5xx status codes once in two months or something like that, despite nothing changing.
Elaborate. Are you getting turnstile challenges? Or blocked entirely? Are you using a weird browser configuration or non-mainstream browser? Do you have extensions that lie about your software/hardware configuration?
That could be plausible deniability. I mean, CF is in fact keeping a tab on who is visiting which websites. Between them and Google, these two companies know everything about everyone.
Why did we start treating Cloudflare (a public, for-profit company) as the undisputed authority on anything related to the network layer of the internet in the first place?
Because they inserted themselves into almost everything we do online and basically managed to take control over it. Cloudflare should never have been allowed to man in the middle the entire internet, but now that they have they're the only man on earth with a dataset that size.
(I'm ex CF) This is backwards. Nobody "allowed" anyting. CF serves a customers need. You can argue with the solution but you can't argue with the core problem. It's more healthy to start the conversation of _why_ CF services are valuable.
> CF serves a customers need
CF serves something it convinced customers they need.
Static blogs hiding behind bot protection (in some cases blocking legit users from GrapheneOS because it's difficult to fingerprint them) because someone convinced them they'll be DDoSed by bots otherwise is a loss to the Internet.
A lot of self-hosters running CF tunnels because they don't know better also contributes.
> It's more healthy to start the conversation of _why_ CF services are valuable.
Begging the question. It's what TFA is about - telling people they need CF.
> A lot of self-hosters running CF tunnels because they don't know better also contributes.
Are you saying CF documentation is better than Computer Science / Networking education resources? Why don't people know better? I thought the tunnels are mostly used to bypass NAT's.
> Static blogs hiding behind bot protection
I'm not sure what is the proportion of the static vs dynamic sites, but I would argue that for wordpress CF is adding real value.
> I thought the tunnels are mostly used to bypass NAT's.
While not free, you can do with with TCP HAProxy streams on a cheap VPS. A lot of people using them to bypass NAT don't realise that Cloudflare decrypt the traffic on the way - that's what I meant about them not knowing better.
>A lot of self-hosters running CF tunnels because they don't know better also contributes.
Of course, everyone else is incompetent except you...
It's not about being incompetent. Quite often in selfhosted subreddits and forums you will see people surprised that Cloudflare can see their traffic in plaintext.
Of course, they probably don't, but the fact that they can and that their policies now influence XX% of internet traffic is bad for the open internet.
I think it counts as "allowed" regardless of utility. CF is so massively over-weight on the internet that it's impossible to trust them with anything because if they can be forced to do something by a hostile government (hint: they can be!) then they can get away with it invisibly and affect billions of people.
That is something that should not be allowed to exist. It's one of the reasons monopolies (or even majority-opolies) are bad. It's a weapon hanging on the wall, waiting to be used.
Then I think the real question is why haven't any serious competitors emerged that can handle the essential services that Cloudflare provides?
Are there network effects like what happens with Microsoft in the business computing space? With Microsoft, I'm also aware of a great amount of anti-competitive behavior, and though I haven't seen that from Cloudflare personally and haven't heard accusations of it, I also haven't paid attention.
When I learned econ 101 in high school there was a concept of a "natural monopoly" like an electricity utility, a concept that was probably mostly post-hoc rationalization of the regulatory structures that were chosen a century ago, but it at least was a coherent narrative. I can't see any coherent narrative about Cloudflare's services being a natural monopoly. So I'm left wondering if they are just way better at what they do than anybody else, and perhaps the space isn't big enough to drive a competitor to enter it?
I hope somebody on HN has a much better explanation of this than I do.
I suspect a big part of it is that CF is running other businesses on the side, and offering basic features at a loss - they've artificially depressed the price of the service so it's hard to compete with them on only that service. It gets them a lot of customers who then buy other profitable things later.
Everyone using the free service likes that, of course, but honestly I wish we'd make it illegal to do. It's heavily used as a way to steal small markets simply by being successful in a different large one.
Seems more of a moral hazard of government intervention than of ventures whose economies of scale demand large market cap to most economically serve customer needs.
Regulation could conceivably disallow a single company to control such a significant portion of internet traffic. The parent can be interpreted as lamenting the absence of such regulation.
The only thing more annoying than people chanting "regulation is bad" is people chanting "regulation is good".
What regulation? Be specific.
CloudFlare provides significant utility to me. I chose to use them. Explain why you think someone else needs to butt into this relationship.
It's certainly a huge risk when one company's outage can take down so much of internet. Decentralization and the ability to route around damage was a core feature of the internet. The more we depend on the internet, the less acceptable that loss becomes.
cloudflare (and CDNs more generally) certainly started out as an own goal. Warnings were ignored and most people did the easy thing instead of the right thing.
The fent dealer just serves the customers needs too
CF's current position in the internet is way larger than "serves a customer's need"
CF's customer is the website that elected to put CF in-between you and them
No, it's more healthy to start the conversation on why we allow corporations to do bad things with excuses like "just serving customer's needs"
I don't get it, what is bad about what they are doing? People need a CDN, they choose the one they find the best.
The discussion revolves around the equivalent of taking nutrition advice from Coca-Cola's blog.
escalated quickly
They give/sell me the thing that I want, and it works for me. Multiply it by hundreds of thousands.
I’d blame the bad actors, rather than service providers that alleviate the problem.
> they inserted themselves
I am unaware of such a capability of Cloudflare.
I believe it is the site administrators who have inserted Cloudflare in between their sites and their users.
Usually it is done for rational reasons of establishing a protection against bots. But what is less rational, in my opinion, is when everyone uses the same provider for that.
Because it indirectly turns Cloudflare into a monopoly. And monopolies often converge to a state when they start to abuse their position.
The amount of people in these comments demonizing Cloudflare services or conflating their very existence with rubbish/trash/nonsense because you personally know better is wild.
Y'know what would be better? Making a site showcasing what tools are better than Cloudflare's services! And how to use them! And sharing said site so people know about them!
Cloudflare bot detection has taught me a reflex to close the tab every time I see its logo.
Yeah, I've seriously considered finding or building a CF-protected-detector browser extension to flag domains. Having one company MITMing so much traffic is straightforwardly dangerous, and not just an annoyance. We need competition.
Why so? They're all in NS already: *.ns.cloudflare.com
"why so" what? not sure what you're questioning/implying here.
I kind of do the same, not every time, but sometimes if I keep getting it on the same site, I seriously question how accurate their stats are without deep diving them more.
switching to tmobile home internet has been eye opening to me on how different the internet can be from person to person. you don't even get your own ipv4 address. makes me realize the challenge behind blocking something like yt-dlp
I confess a sad assumption that bot traffic is far higher than we have admitted for a long time. Though, maybe we would see different stats specifically to social media sights to astroturf like counts? Certainly feels that we have known for a long time that bots were larger in ad viewing than ad companies wanted to admit.
I don't understand what difference bots make. For me, a website (the public part) is a storefront. People walk down the street and see what's inside — that's the purpose. If something should not be available immediately, that's the private part of the store.
I've been monitoring bot traffic on digital platforms for over 10 years. Sure, the crawler share is growing, some even with malicious intentions, and those I detect and block.
I disagree that this pain is worth the cost of making real people spend their life on verification.
For ad views, the concern is specifically that people pay for clicks and views. That that can be so heavily influenced by bot traffic greatly undermines their value.
Same general idea goes for any of the algorithmic driven platforms. The algorithms are ostensibly intended to surface organically discovered things by watching how people interact with things. That they are so susceptible to distortion through bot farms should be a lot more acknowledged than it is. People trust them far more than they should.
There is also a general cost of running things concern. It isn't like it is completely free to execute on bot traffic.
For ads, I believe this must be a problem for ad platform owners.
If the digital platform's storefront is their business, they could afford to spend some budget on bot detection. Bots still come from data center networks, sometimes render pages incompletely, request resources in bulk, and show enough patterns to be flagged internally.
If we look at a medium website, most random crawlers will come from Amazon, Microsoft, DigitalOcean, Hetzner, OVH, and a few other DC networks — these can be blocked easily without harming real users. The rest can be detected and cleaned up, even manually.
The math is simple: 20,000 visits a day at 15 seconds each = ~83 hours a day lost watching a Cloudflare logo, just because someone doesn't want to dig into the logs. I don't buy it.
Largely agreed, though I think you are likely underestimating how hard this is to detect. In particular, it is true that many bots can be hosted in data centers, but it is somewhat trivial to launder that traffic through other sources. Malware, in particular, is what I have in mind. Maybe I'm wrong and that has largely gone away?
There is also a bit of mixed incentives. Yes, it is the ad platform that is getting abused. But it is also the ad platform that is charging people based on abused practices.
And it isn't like this is completely made up. Just look at how facebook killed a lot of ton of people during the "pivot to video" programs. I don't know all of the details, as I was thankfully not in any of the involved industries, but my understanding is it is fairly well documented.
Edit: I changed an "isn't" to "is." I think I was trying to reword at one point, but left it in a way that is opposite what I meant.
When most of your server capacity is going to answering the scrapers it matters. It's not that the stuff is hidden, it's that storefront being flooded with 10x as many customers as the fire code allows. And some of them go around asking your employees mindless questions. (Small forum I help moderate: we were getting hammered with what was probably some sort of AI that was taking search queries and feeding them into the forum search. Search is now registered users only.)
Well the fun things is that no one knows how much traffic of what kind they are getting when they use Cloudflare.
You get the numbers that Cloudflare tells you, but who knows if you can trust their stats after their CEO is apparently cherry-picking data to shape their product narrative?
That same CEO too that just went on a wild tone-def layoff justification, classifying human employees into roles of either a builder, seller, or measurer and saying he wants to get rid of everyone that "measures" the business...
I wouldn't trust a single thing coming out of his mouth.
Do people really expect CEOs to be knowledgeable about any technically details in 2026? My experience is that CEOs are getting increasingly out of touch with what their employees actually do and what their customers want.
I love when smart people catch liars.