What is Semantic HTML?
Semantics is the study of meaning: how meaning is created and applied to signs. “Why does X mean X?” is a question of semantics.
HTML is the markup language that we use to write web pages. It’s understood by standard web browsers, as well as dozens of other types of “user agents”, including mobile phones, search engine spiders, aural browsers etc.)
HTML consists of two types of things:
- Text content
A few tags can be content of their own (like images, Flash movies, or metadata), but most HTML tags are used to apply structure to content.
Semantic HTML, or “semantically-correct HTML”, is HTML where the tags used to structure content are selected and applied appropriately to the meaning of the content.
So, if you’re wanting your HTML to be semantically-correct…
<p></p> paragraph tag pair should only be used to indicate a paragraph (which is a structural concept). It should never be used to apply space to a web page. Never, ever, use a series of <p> tags to create space!
The HTML tags <b></b> (for bold), and <i></i> (for italic) should never be used, because they’re to do with formatting, not with the meaning or structure of the content. Instead, use the replacements <strong></strong> and <em></em> (meaning emphasis), which by default will turn text bold and italic (but don’t have to do so in all browsers), while adding meaning to the structure of the content.
Always separate style from content
Why semantically correct HTML is better
Writing semantic HTML brings a wide range of benefits:
- Ease of use
- Search Engine Optimisation
Ease of use
First of all, semantic HTML is clean HTML. It’s much easier to read and edit markup that’s not littered with extra tags and inline styling. Clean markup also saves time and money when other people have to interact with it – say, a web developer who has to implement your page template in a content management system or any other web application.
A corollary benefit is that your HTML files are also smaller, so they load quicker.
Unless you’ve had to interact with HTML markup through media other than your web browser, it doesn’t seem obvious to imagine that your web pages have a life outside the browser window – but they very often do. Web pages can be consumed by humans and machines in lots of different ways!
When you separate visual aspects (i.e. style) from the actual meaning of a document, you end up with a document that always means the same thing. The way it’s presented or consumed can vary. One common technique web designers use is to apply different style sheets for different media. For example, you can apply a certain stylesheet only when a document is printed to paper, another one when it’s viewed on screen, and yet another when it’s accessed by a text-to-speech aural browser.
A text-to-speech reader also understands the tags <strong> or <em> but it treats text output with those tags very differently to the way a visual browser responds. The TTS reader adjusts vocal tone or volume, rather than contrast or text style, which conveys the same meaning but through a different medium.
Search Engine Optimisation
Search engine spiders and crawlers, like Googlebot, represent another genus of user agents. They also consume web page content, in an attempt to discern the meaning within.
When a crawler finds a web page, it stores its assessment of what the page is about on an indexed database to use when matching people’s search queries. The big question is – how do search engines match search terms to known pages to create a prioritised list?
Of course, they all do it a bit differently, but one of the keys to Search Engine Optimisation is to use plain old common sense. If you were a search engine, how would you do it? If you work through the problems a search engine faces, a few things soon become clear, often easily expressed prefixed with “all other things being equal…”.
Let’s say you have two web pages, each with exactly the same text content (10 kilobytes).
One of the pages has an additional 5KB of HTML markup, neatly annotating the semantic meaning in the content.
The second page has 30KB of additional markup, with inline styles, lots of nested <div> tags, and decorative imagery.
Now, the more graphically intense page might look better to human visitors (might!), but if each page contains the search term “bluebottle” 5 times, which would you (pretending to be a search engine) judge was most relevant to someone searching for “bluebottle”?
Clearly, it’s the first, more lightweight page, for a few possible reasons:
- The keyword density of the lightweight page is greater. It features the search term five times in 15KB of markup, whereas the second page features it five times in 40KB of markup. Whatever the additional markup is for (the search engine might not be able to tell), it doesn’t seem to be about “bluebottle”.
- Each occurrence of the search term is likely to be higher up towards the start of the document in the lightweight page than it is in the 40KB page. All other things being equal, the earlier you find a search term within a document, it’s more likely that the document is about that term, or the term is more prominent in the document’s content.
- Assuming that the first document is neatly marked up with semantically correct HTML, it’s more likely that the search term will be placed inside a higher-value tag (such as a heading, or link) than in a more graphical page (which might use an image as a link, perhaps without a proper alt attribute).
When your markup (content, with meaning) is separated from your styles (style sheets for different media), obviously the content can be understood more easily by all user agents. That means not only user agents you already know about, but ones you don’t yet know about (like automated crawlers that create custom RSS news feeds on a certain topic, or image- or video-specific search engines), as well as others that have not yet been invented!
The last couple of years have seen mixing and mashing content emerge as a major feature of new web sites and applications. This can happen without the knowledge of the original site owner, but in most cases this freedom of content to move around the web, adapting to various media, is beneficial to the original creator.
Often in these situations, the content taken from a web page is formatted differently on the new remixed page, which makes it all the more important to remove any style content from the markup itself. (Note that inline styles, applied directly within HTML tags, override any other styles implemented through separate stylesheets, and so they would have to be stripped off programatically.)
Clearly, it’s easier to grab and re-use content from any source, and apply it to any medium, when it does not contain any hard-coded style information, and also when it does contain semantic markup that can help a computer program understand the meaning and structure of the content.