The Knowledge Panel Course: The Google Knowledge Extraction Algorithm
Script from the lesson The Knowledge Panel Course
The Google Knowledge Extraction Algorithm
Jason Barnard speaking: Hi and welcome. The Knowledge Extraction Algorithm is part of the crawling and indexing process. We can consider that it is part of Googlebot.
Jason Barnard speaking: Googlebot discovers, crawls, and indexes content. Before it adds content to the index, it breaks that content down into its constituent parts and annotates them. It attempts to identify headings, menus, images, videos, Schema Markup, and so on so that it can label what type of information it is. When it annotates, it adds a confidence score that indicates its level of confidence that the annotation is correct.
Jason Barnard speaking: All the other algorithms that use the Web Index rely heavily on these annotations to extract information that is relevant for them from the Web Index. Without these annotations, the other algorithms simply cannot see the information. That makes this the single most important algorithm to focus on. Since if it can’t annotate your content, your information is, to all intents and purposes, invisible to the other algorithms.
Jason Barnard speaking: Some web content is structured in a manner that suits the needs for the Knowledge Graph, including Schema.org Markup, for example. This explicit and standardised structure is very easy for the algorithms to annotate with a high level of confidence.
Jason Barnard speaking: When a web page implements strict semantic HTML5 correctly, headings, HTML tables for data only, lists, et cetera, the data can also be considered structured. And in this case, it is relatively easy for the algorithms to annotate with a high level of confidence.
Jason Barnard speaking: When semantic HTML5 is only partially or inconsistently used, the data becomes semi-structured, which helps with annotation and structured indexing but generally leaves the Knowledge Extraction Algorithm with some guess work. Annotations will be tagged with lower confidence scores in these cases.
Jason Barnard speaking: But the vast majority of information Googlebot collects is unstructured, which means that the Knowledge Extraction Algorithm needs to make a guess and try to create structure before annotating and adding the information to the Web Index. Annotations will be tagged with very low confidence scores, which will penalise them in terms of findability and prioritisation for the other algorithms.
Jason Barnard speaking: In all cases, Google will use more than just the HTML tagging and Schema. It analyses each chunk of written content using, amongst others, natural language processing, image analysis, video analysis, and many others.
Jason Barnard speaking: Websites are messier and less well organised than most of us tend to imagine. Although we tend to think that we organise sites, web pages, and content systematically and logically, that is rarely the case within one site run by a few people, let alone across billions of sites run by hundreds of millions of people with different approaches, different systems, and personal quirks. So, the algorithm is faced with annotating a huge, inconsistent mess.
Jason Barnard speaking: Structured data is obviously, therefore a huge help to the Knowledge Extraction Algorithm and therefore gives you an advantage. Having a messy, unstructured page with inconsistent headings is a huge disadvantage.
Jason Barnard speaking: It’s important to note that the Knowledge Extraction Algorithm is looking at the individual elements of a page, the headings, the menu, the texts, the images, the videos, HTML tables, lists, Schema Markup, and so on. It is attempting to break the page content down into manageable chunks and accurately annotate each chunk. The aim is for the annotated chunks in the index to provide easy-to-identify pieces that are meaningful and helpful to all the other algorithms. The algorithm for these different results such as the blue links, the images, the videos, the news, and of course, the other Knowledge Algorithms.
Jason Barnard speaking: In a web page that is spectacularly badly organised, the Knowledge Extraction Algorithm will either not annotate at all or perhaps get the annotation wrong. In any case, if the Knowledge Extraction Algorithm is not confident in its understanding of the information, its role in the page, and its structure, it will attribute a low confidence score to the annotation. This is a major disadvantage for that information when being considered by the other Knowledge Algorithms.
Jason Barnard speaking: And a huge point here: without structured annotation, the other Knowledge Algorithms cannot use the data at all. With lower confidence scores on an annotation, the other Knowledge Algorithms are less likely to use that information.
Jason Barnard speaking: So, you need to make every effort to package your content well for the Knowledge Extraction Algorithm. If you don’t, your content might not even get into the game.
Jason Barnard speaking: Giving your content explicit structure using Schema.org Markup, HTML tables, headings, sections, et cetera becomes fundamentally important to feeding the Knowledge Graph with the facts about and around your entity. So, you need to think about what you can do to help the bot understand what role each piece of content plays on a page and importantly, how far you can push your efforts in order to encourage the Knowledge Extraction Algorithm to attribute a high confidence score to the annotations it gives to important information about the entity.
Jason Barnard speaking: For example, the H1 describes the main topic of the page. On the Entity Home, the topic is the entity itself, so the H1 must contain the entity name, obviously. All H2 are subtopics. On the Entity Home, the subtopics are the different aspects of the entity. The H3 are sub-subtopics and so on. An ordered list provides steps or assigns relative importance to each point. An unordered list is neutral. Alt tags, captions, titles, and other image optimisations are a huge help to Googlebot too.
Jason Barnard speaking: In short, use semantic HTML5. There are links to two articles about best practices in the additional materials of this lesson. Some of the techniques these articles explain in detail are: using headings and subheadings correctly, using HTML tables and lists where appropriate, using em tags rather than bold, and much much more. The most important thing they teach is that you need to disassociate the design from the
semantics. Read the articles and upgrade your entity focused content to HTML5.
Jason Barnard speaking: Any opportunity to provide Googlebot with helpful clues is worth taking. Don’t shy away from becoming granular. Look at your HTML in detail and get rid of every piece that is liable to confuse the machine. Use strict validation to be sure.
Jason Barnard speaking: Potentially add descriptive class names to your CSS. That can help.
Jason Barnard speaking: Other ways to provide clues and help Googlebot is adding human corrected subtitles to videos or adding chapter markers with titles, and then adding a transcript to the page using the chapter marker titles as subheadings. That will help it understand the video.
Jason Barnard speaking: And don’t forget the basics either. Write clearly using semantic triples, create context clouds, and use simple vocabulary. There’s a PDF download in the additional materials of this lesson that will help you with that.
Jason Barnard speaking: Use helpful anchor texts on links. This is a huge help since Googlebot is able to digest just one page at a time. The anchor texts allow it to guess what is on the other side, in this case, information about the entity. I talk about this more in the lesson about joining the dots in a non-geeky manner.
Always use Schema Markup when possible. Schema Markup repeats the contents of the page in a manner that is structured in a way I call Google’s native language.
Jason Barnard speaking: Imagine that Google’s Knowledge Algorithm has made a good guess at the contents of the page and has 40% confidence in its educated guess. By providing corroboration in Schema Markup on the page itself using Google’s native language as I said, it can attribute a higher confidence score, let’s say 80%, to all the annotations the Schema Markup supports.
Jason Barnard speaking: Notably, in the context of Knowledge Panels and the Knowledge Vault, it can attribute significantly higher confidence scores to annotations to information that is important to the understanding for the Knowledge Panel and the Knowledge Vault algorithms.
Jason Barnard speaking: Now, all of this lesson appears to be focused on the Entity Home. I recommend that you super optimise the Entity Home, since this is the fulcrum for you and for Google’s Knowledge Algorithms.
Jason Barnard speaking: But why stop there? Each and every one of these techniques is well worth taking on all the other pages about the entity you control, as well as second and third party sources. Take every single opportunity where you can help Google’s Knowledge Extraction Algorithm correctly and confidently annotate information about the entity. It’s a no-brainer.
Jason Barnard speaking: Second party corroborations might have limited opportunities for this type of optimisation, but take those that you can. Look at which of these techniques you can implement on the entity’s social profiles, human curated databases such as IMDb, Wikipedia, Wikidata, Crunchbase, LinkedIn, and so on. And then reach out to third party sites, articles about the entity, reviews about the entity, and so on, and see what you can do there.
Jason Barnard speaking: Obviously, you have no control. But that also means that as long as the source is relevant and authoritative, the information carries more weight in Google’s algorithms. So, reach out to the person who wrote the information or controls the webpage and ask them to implement techniques that help the Knowledge Extraction Algorithm.
Jason Barnard speaking: As you know, the more confidence Google has in the annotations, the better. And authoritative third party sources carry most weight. That’s the key.
Jason Barnard speaking: Thank you very much, and I’ll see you soon.