How low-value programmatic SEO is changing the rules of Google's crawling game

Elie Berreby Programmatic SEO April 6, 2024

I love Google Search Console (GSC) because it is the most accurate SEO tool.

Unlike third-party solutions, the data in GSC comes directly from Google. Nothing is extrapolated: you see what Google wants you to see. But is it always accurate?

Whenever you read “Discovered — currently not indexed“, what Google should have written is “Detected but not currently crawled“. It is crucial to remember this as you go through this article, and you’ll understand why in the end.

While Google has always denied the existence of a “crawl budget” on its end, I’ll show you that there’s a similar internal metric Google uses to decide how to crawl each website!

Before any page can be ranked in SERPs, your content has to go through 3 phases:

1. Discovery

2. Crawling

3. Indexing

Discovery means Google is aware that those pages exist. URLs have been detected, which is the first step. Google acknowledges their existence, however calling this the “pre-crawling” phase would be inaccurate. Google will arbitrarily decide if what was discovered deserves to be crawled. More on this later to understand why Google might suddenly and violently change its mind.

Crawling means you’ve gone beyond discovery/detection: your content has been crawled by Googlebot, meaning that your content has been processed but you do not exist in Google’s index yet. Google will arbitrarily decide if you deserve to be indexed later on.

Indexing is easy to understand: your content has successfully gone through the discovery and crawling phases. You’ve been detected and crawled and now exist inside Google’s index.

Now, what could a sudden drop in pages “Discovered but currently not indexed” mean?

REMEMBER: Google isn’t accurate here because no content can jump from “discovery” to “indexing” without going through the crawling phase!

When you have a lot of “discovered” but pending to be crawled pages, there are 3 possibilities:

1. Server Capacity (bandwidth, CPU load, limited RAM, etc)

2. Web Server Issues (Apache/Nginx configuration issues, unexpected HTTP response codes, etc — log file analysis is always recommended)

3. Website Quality (after a few months of observation and analysis, Google gives an invisible and confidential internal score to each indexed website — by considering the overall site quality, including the content, design, layout, etc).

As a rule, Google always declared that “discovered but not indexed” pages meant that Googlebot wanted to crawl the URLs but postponed the crawl because your website might be overloaded. In other words, the problem was on your side, not theirs! Google almost always points to your back end.

Technical SEOs listened and focused on fixing the back-end by tweaking the server and web server, which is absolutely needed.

New SEO chapter: AI-generated content is rewriting Google’s crawling rules

November 30, 2022 was the release date of a Large Language Model named ChatGPT. Since then, it has been easier to generate content than ever before. Content production reached a new scale.

Suddenly, I saw programmatic SEO becoming almost synonymous with AI-generated content. Everybody started “injecting” an unbelievable amount of new textual content into websites and initially, nearly every web page was discovered, crawled and indexed.

In 2020, 2 years before the AI content explosion, Google had about 400 billion documents in its index.

Google Anti Trust Trial: About 400 billion documents (webpages) indexed in Google's index.

In 2024, there are about 200 million active websites worldwide with hundreds of billions of web pages, but here’s why I expect fewer and fewer web pages to be indexed in the future. When so much low-quality content is produced artificially and copy-pasted on web pages (if not automatically injected into websites directly through APIs), this has major implications for Google.

Tasks such as crawling and indexing suddenly became technical and financial burdens. I bet Google didn’t expect this at the start of 2022!

Google has a secret Crawl Budget for each indexed site

Google has historically denied the existence of a “Crawl Budget” on their end, claiming it was a made-up expression created by the SEO community.

I’ll show you that Google not only has a varying crawl budget dedicated to each indexed website but that, because of the commoditization of content created by generative AI models, the rules of the crawling game have changed!

Before generative AI became a thing, the internet was already full of trash. More than ever before, Google’s challenge is to avoid indexing low-quality content.

Today, while pages are still being “discovered”, many are purged before reaching the crawling phase. Google is taking out the programmatic trash before it hits the fans (which means “us”, the users). I’ve seen quality content suffer from this “unfair” process because the detection algorithms aren’t perfect.

Within Alphabet Inc, there’s what’s known as the Googlebot budget. Google tells its Googlebot crawler how much time it is allowed to spend on each website, and this time budget is determined by a confidential website quality score.

You may think the Googlebot budget is XXXX URLs per day, but that is not how it works. For Google, there’s no URL limit per website, but there is a time limit.

When Googlebot crawls websites, each millisecond has a cost regarding bandwidth, server resources, electricity, etc.

Within Google, crawling is measured using the number of queries per second.

Based on your overall site quality score, Google decides to spend a given number of seconds on your site. The good news is that this quality score isn’t forever.

The bad news? After you raise the bar, it takes months to generate a new quality score on Google’s end. Yes, there’s a long lag!

Crawling, fast and slow

Whatever the pre-allocated number of crawling seconds dedicated to your website, the slower your server, the fewer Googlebot queries you’ll be able to respond to. The faster your server, the more queries you’ll react to per second.

The quicker your server replies to Googlebot’s queries, the more detailed and accurate the crawling picture on Google’s end. A server that is able to respond quickly to Googlebot’s queries will increase the final amount of content crawled (higher probability of having more quality content indexed and later ranked).

The faster your server responds to Googlebot, the higher the number of queries per second, the more content is crawled.

This has a significant implication for crawling: a slow server today means less content will be crawled on your website. But improve your server’s performance and if during the next visit Googlebot notices that you are responding much faster (answering many more queries per second), it will conclude that your server is getting healthier. Even when Googlebot isn’t spending more seconds crawling you, responding faster to queries increases the amount of content crawled.

You should now understand that two crucial parameters will impact how Googlebot crawls your site (and yes, you can influence both):

The confidential website quality score dictates how many seconds Googlebot will spend crawling your site.
The speed at which your server responds to Googlebot’s queries will influence the amount of content being crawled.

You have to understand that there’s one Googlebot budget for each site, including your HTML, text, images, JavaScript, and PDFs. You should be especially careful with JavaScript because Google needs to render everything to see the final version of your page (server-side rendering vs client-side rendering)!

If you overload your site with JavaScript, you’ll force Googlebot to spend its time budget on rendering rather than on your actual content such as text, images, etc.

If you use a lot of JavaScript, compare the crawl budget spent on JS vs HTML.

Google’s discovery, crawling and indexing

We’ve reached a point where Google has to reconsider its traditional processes of discovery, crawling, and indexing because too much is too much! My wild guess is that Google recently updated their detection algorithms used before crawling and indexing.

Today, if Google does not deem those pages worthy of crawling and indexing, hundreds of thousands and sometimes millions of discovered pages might be purged and this will be visible in your Search Console data.

I believe Google’s new approach is to flag programmatic content, group those pages into clusters, and act directly at the cluster level.

Don’t forget that content quality is a metric that affects overall site quality.

If your content is flagged as programmatically generated by AI models, you might see a violent drop in “Discovered — currently not indexed” web pages.

If you see this happening to an established brand, I suggest being very careful because it could be the start of an upcoming site-wide penalty.

Because content quality impacts the overall website quality, trashy content could lead to a low site quality score and increase the risk of future penalties at a domain level. If that happens, even high-quality human-written content might suffer from unfair penalties. Not a game established brands should play!

As I explained earlier, this never happens in real-time because Google usually takes 2 to 6 months (sometimes even longer on large websites / enterprise SEO) before it reaches a conclusion regarding the overall website quality.

The site quality score is a lagging indicator: what you do today will be reflected in the quality score months later, impacting the Googlebot time budget allocated to your website.

My conclusion and summary.

If Google discovers a lot of your pages but refuses to crawl them, it is usually because of:

Technical issues (at the server or web server level, rarely front-end)
Content quality issues (spammy content or unhelpful content)

We’ve reached a point where this flood of programmatically AI-generated content has lowered the bar in terms of quality but raised the bar in terms of quantity.

This unexpected change in the ability to produce content faster than ever before has forced Google to update its traditional discovery and crawling approach.

For Google, each millisecond spent crawling content means money spent.

Google’s budget isn’t limitless because its goal is to maximize profit. Google’s approach is simple: we crawl what we believe should be indexed and ranked.

Finally, remember that Alphabet Inc makes most of its money not because Google ranks high-quality content at the top of SERPs but because it sells contextual ads that compete with the highest-ranking organic results!

Google’s organic SERPs are seeing more low-quality programmatic content. If this situation were to continue, it would negatively impact Google’s primary revenue source: targeted ads generated according to user search intent.

Follow the money to understand Google’s new behavior regarding discovered pages and crawling!

PS: If you want to see what real programmatic SEO is, read the long article I wrote last year.

EMAIL