How Bingbot Works: Discovering, Crawling, Extracting & Indexing

Originally Published on SearchEngineJournal April 15, 2020 (Jason Barnard)

Learn how Bingbot works during the discovery, crawling, extracting and indexing stages from Fabrice Canel, Bing’s Principal Program Manager.

Here’s a recap of my interview with “Bingbot boss” Fabrice Canel (officially: Bing’s Principal Program Manager).

Canel is in charge of discovering all the content on the web, selecting the best, processing it and storing it – a phenomenal responsibility, as it turns out (read on).

It Seems Safe to Assume That Googlebot Functions in Much the Same Way

Bingbot and Googlebot don’t function exactly the same way down to the tiniest detail. But close enough for:

The process is exactly the same: discover, crawl, extract, index.
The content they are indexing is exactly the same.
The problems they face are exactly the same.
The technology they use is the same.

So the details of exactly how they achieve each step will differ.

But Canel confirms that they are collaborating on Chromium and standardizing the crawling and rendering.

All of that makes anything Canel shares on how discovering, crawling, extracting and indexing by Bingbot very insightful and super-helpful.

Discovering, Crawling, Extracting & Indexing Is the Bedrock of Any Search Engine

Obvious statement, I know.

But for me, what stands out is the extent to which how this process underpins absolutely everything that follows.

Not only does a great deal of content get excluded before even being considered by the ranking algos, but badly-organized content has a significant handicap both in the way it is indexed and also in the way algos treat it.

Great organization of content in logical, simple blocks gives an enormous advantage all the way through the process – right up to selection, position and how it displays in the SERPs.

Well-structured and well-presented content rises to the top in a mechanical manner that is simple to grasp and deeply encouraging.

Discovering & Crawling

Every day, Bingbot finds 70 billion URLs that they have never seen before.

And every day they have to follow all the links they find, and also crawl and fetch every resulting page since, until they have fetched the page, they have no idea if the content is useful.

Pre-Filtering Content

And there’s the first interesting point Canel shares.

The filtering starts here.

Pages that are deemed to have absolutely no potential for being useful in satisfying a user’s search query in Bing results are not retained.

So a page that looks like spam or duplicate or thin never even makes it into the index.

But more than rejecting spammy pages, Bingbot tries to get ahead of the game by predicting which links are likely to take it to useless content.

To predict whether any given link leads to content that is likely to be valuable or not, it looks at signals such as:

URL structure.
Length of URL.
Number of variables.
Inbound link quality.
And so on.

Can you distinguish the good clicks from the bad?
Invalid clicks hide in the noise of normal activity. Let AI catch the subtle signs of PPC click fraud that humans miss & improve your ROAS.

A link that leads to useless content is referred to as a “dead” link.

As machine learning improves, less of these dead links will be followed, less useless pages will slip through this early filter and the index will improve.

The algos will have to deal with less “chaff”, meaning it is easier for them to identify the best content and put that in front of Bing’s clients.

Importantly, Bing has a heavy focus on:

Reducing crawling, rendering, and indexing of chaff (saving money).
Reducing carbon emissions (Canel insists heavily on this).
Improving the performance of the ranking algorithms.
Generating better results.

Links Remain Key to Discovery

The biggest signal that a page is not valuable is that there are no inbound links.

Every page needs at least one inbound link – obviously, that link does not need to be from a third party – it can be an internal link.

But, Once Discovered, They Are Not Needed Since Bingbot Has a ‘Memory’

Bingbot retains every URL in memory and comes back and recrawls intermittently, even if all links to it have been removed.

This explains why Bingbot (and Googlebot) come back and check deleted pages that have no inbound links, even months after the page and all references to it have been removed.

I have had this exact situation on my site – old pages that I deleted 5 months ago coming back to haunt me (and Bing and Google!).

Why?

Because Bing considers that any URL may suddenly come back to life and become valuable – for example:

Parked domains that become active.
Domains that change ownership and spark into life.
Broken links on a site that are corrected by the owner.

URL Lifecycles Are a ‘Thing’ at Bing

There is a limit: what Canel calls the “lifecycle.”

Once that lifecycle completed, the URL will no longer be crawled from memory – it can be revived via the discovery of an inbound link, reference in an RSS feed, or a sitemap or submission through their API.

Canel is insistent that providing RSS feeds and sitemaps are vital tools that help us to help Bingbot and Googlebot not only discover new and revived content but also crawl “known” content efficiently.

Better still, use the indexing API since that is much more efficient both in helping them discover content, but also in reducing wasted / redundant crawling, thus reducing carbon emissions.

He speaks more about that in this episode of the podcast.

Extracting

I’m a fan of HTML5.

Turns out that, although theoretically super-useful because it identifies the role specific elements of a page play, HTML5 is rarely implemented well.

So, although it should give structure and semantics that help bots extract information from a page, more often than not, it doesn’t.

John Mueller from Google suggested that strict HTML5 wasn’t necessarily very useful to bots for exactly that reason.

Canel is categorical that any standardized structure is helpful.

Using heading tags correctly to identify the topic, sub-topics, and sub-sub-topics is the least you can do.

Using tables and lists is also simple yet powerful.

Sections, asides, headers, footers and other semantic HTML5 tags DO help Bingbot (and almost certainly Googlebot) and are well worth implementing if you can.

Quick word on HTML tables.

They are a very powerful way to structure data – just stop using them for design.

Over 80% of table on the web are used for design, but tables are for presenting data, not for design… and that is very confusing for a machine. (Canel uses the term distracting, which I love because it makes the Bot more human.)

Do Bingbot a favor and use a table to present data such as the planets in the solar system.

Use DIV and CSS to position content within the layout of the page.

But any systemization of structure is worth considering.

If you build a bespoke CMS, us HTML5 to help bots “digest.”

Otherwise, any off-the-shelf CMS helps make extracting easier for the bots.

With standard CMS systems, they see the same overall structure time and again, and that repetition is exactly what machine learning can get to grips with best.

So it is well worth considering building your site with a popular CMS such as Joomla, Typo3, or WordPress.

From the point of view of helping bots extract content from your pages, WordPress is obviously the best candidate since over 30% of sites are built using WordPress.

The Bot sees the same basic structure on one in three sites it visits.

And that leads nicely onto …

Bots & Machine Learning

It is important to remember that machine learning drives every single step in the discovery-crawling-extraction-indexing process. So machine learning is the key.

A deep understanding of the pages (Canel’s term) and an intelligent, evolving system for extracting is key for Bing, for Google, and for website owners.

In order to best extract and index your content, a bot needs patterns in the underlying HTML code.

So a big advantage for us all is to work hard to ensure that our own links, site structure, page structure, and HTML are all consistent… and if possible, consistent with standards that also apply outside our own site.

But… All Sites Will Be the Same

It might seem that building a site with the same structure as multiple other sites across the web means they will all blend into each other. That isn’t the case.

Design is independent of HTML structure. And that is exactly the point of HTML5 – to disassociate the design from the semantics. This article covers that point.

Structure is not going to be exactly the same (very small sites with just half a dozen pages accepted).

And even if it is, in truth, why would that matter?

Content you create is unique (one would hope). As such, even when talking about the same topic, no two brands will say the same thing.

So, if you use WordPress, and choose a popular theme you will tick all the boxes for the bots… and yet your design, structure, and content will still be unique for your audience.

You win on both fronts.

In short, unless you are a major company with a large budget, sticking to a popular template on a common CMS will often be a good choice since, because they are common, these will be natively understood by all search engines.

Your content is unique, and you can completely change the visual presentation unique using simple CSS.

Just remember to stick to CSS standards and don’t mess with the CMS core or underlying HTML so as not to confuse Bingbot and Googlebot.

Google & Bing Collaborate

Both bots use Chromium. It is important to remember that Chromium is an open-source browser that underpins not only Chrome but also Opera… and some other browsers.

In this context, the important part is that Bingbot not only switched to Chromium version of Edge in late 2019, but also followed Googlebot in becoming evergreen.

More than that, Canel says Bing and Google are now working closely together on Chromium. It is strange to imagine. And easy to forget.

Canel suggests that it is in both company’s interest to collaborate – they are trying to crawl the exact same content with the same goal.

Given the scale (and cost), they have every interest in standardizing (that word just keeps coming back!).

They cannot expect website owners to develop differently for different bots. And now, after all these years, that appears to be a reality.

Two major crawlers, both using the same browser and both Evergreen. Did developing websites just get a lot easier?

Bingbot’s adoption of Edge will make life easier for the SEO community since we’ll only have to test rendering once.

If a page renders fine in Edge, it will render fine in Chrome, it will render fine for Googlebot and it will render fine for Bingbot. And that is wonderful news for us all.

For info, since January 15, 2020, the publicly distributed version of Microsoft’s browser Edge is built on Chromium.

So, not only are our browsers now mostly built on the same basic code, both major search engine bots are, too.

Extracting for Rich Elements

The growth of rich elements/Darwinism in search was the starting point of this series.

And one thing that I really wanted to understand is how that works from an indexing point of view.

How do Bing and Google maintain at scale an indexing system that serves all these SERP features?

Both bots have become very good at identifying the parts / chunks / blocks of a page, and figuring out what role they play (header, footer, aside, menu, user comments, etc.

They can accurately and reliably extract specific, precise information from the middle of a page, even in cases where the HTML is badly organized (but that’s not an excuse to be lazy).

Once again, machine learning is essential.

It is the key to their ability to do this. And that is what underpins the phenomenal growth in rich elements we have seen these last few years.

It can be useful to take a step back and look at the anatomy of the SERPs today compared to a decade ago.

Rich elements have taken a major place in modern SERPs – to the point at which it is hard to remember the days when we had SERPs with just 10 blue links…. featureless-SERPs.

Indexing / Storing

The way Bingbot stores the information is absolutely crucial to all of the ranking teams.

Every algo relies on the quality of Bingbot’s indexing to provide information they can leverage into the results.

The key is annotation.

Canel’s team annotates the data they store.

They add a rich descriptive layer to the HTML.
They label the parts: heading, paragraph, media, table, aside, footer, etc.

And there is the (very simple) trick that allows them to extract content in an appropriate, often rich, format from the middle of a page and place it in the SERP.

Standards Is the Key to Effective Labeling

Handy hint: from what Canel said earlier, if your HTML follows a known system (such as rigorously correct HTML5 or Gutenberg blocks in WordPress), then that labeling will be more accurate, more granular and more “useable” to the different rich elements.

And, because your content is more easily understood and more easily accessed and extracted from the index, that gives your content a decided advantage right out of the starting gate.

Rich Annotations

Canel uses the term “rich” and talks about “adding a lot of features” which strongly implies that this labeling/annotation is extensive.

Bingbot has an enormous influence on how content is perceived by the ranking algorithms.

Their annotation makes all the difference in the world to how your content is perceived, selected and displayed by the different SERP feature algos.

If your content is inadequately annotated by Bingbot when it is indexed, you have a very serious handicap when it comes to appearing in a SERP – whether it be blue links, featured snippets, news, images, videos…

So, structuring your content at block level is essential.

Using a standardized, logical system and maintaining it throughout your site is the only way to get Bingbot to annotate your content in usable blocks when it stores the page in the database…

And that is the bedrock of whether a chunk of content lives or dies in the SERPs – both in terms of being considered as a potential candidate, but also how and when it is displayed.

Every Result Be It Blue Link or Rich Element Relies on the Same Database

The entire system of ranking and displaying results, whatever the content format or SERP feature, depends on Canel’s team’s understanding of the internet, processing of the internet, and storing of the internet.

There are not multiple discovery, selection, processing or indexing systems for the featured snippet / Q&A, videos and images, news carousels, etc.

Everything is combined together and every team extracts what it needs from that one single source.

The ability of candidate sets to select, analyze and present its list of candidates to the whole page team depends on the annotations Bingbot adds to the pages.

Darwinism in Search Just Got More Interesting

Yes, the ranking algos are Darwinistic as Gary Illyes described, but content in some pages has a seriously heavy advantage from the get-go.

Add Handles to Give Your Content an Unfair Advantage

My understanding is that the “rich layer of annotations” Canel talks about are the handles Cindy Krum uses in her Fraggles theory.

If we add easy-to-identify handles in our own HTML, then the annotations become: more accurate, more granular, and significantly more helpful to the algorithms for the different candidate sets.

HTML “handles” on your content will give it a head start in life in the Darwinistic world of SERPs.

How Bingbot Works: Discovering, Crawling, Extracting & Indexing

It Seems Safe to Assume That Googlebot Functions in Much the Same Way

Discovering, Crawling, Extracting & Indexing Is the Bedrock of Any Search Engine