If you have read the previous article, you should know what web applications are and that they are created using the HTML language. In this article I’m going to concentrate more on what HTML is and what is important in it.
A web page must be created using HTML (Hyper Text Markup Language) as that is the language which is understood by Web Browsers. A HyperText is a document which has links to other parts of the same document or other documents. Although this may seem normal now, it was quite a revolutionary idea, because until HyperText’s invention, text documents were read in a linear fashion.
HTML is the most used format of Web pages (documents provided by the WWW service). It is standardized by W3C – WWW Consortium, the latest HTML version is HTML5 finally standardized in 2014.
HTML is not a programming language, there are no variables, conditions, assignments, etc. This makes learning HTML much easier than learning a programming language (although it may look cryptic from the beginning). The HTML language is designed to describe the structure of a text (or mostly text) document. The HTML language is not designed to describe how a text document will look, it only describes structural elements of the document, such as:
An HTML document is parsed (together with styles) by an interpreter built in the web browser. Then it is rendered by the browser rendering engine on the user’s screen. During rendering, the browser figures out where each part of the document should be displayed, how big it should be and how it should look and then draws some lines and points on the screen.
The HTML language consists of the following parts:
Comments are a part of the HTML code which is not interpreted (rendered):
Comments can be used to take notes or mark spots in the HTML document. Note however that they are still visible in the page source code in a web browser (use the View source function), so do not put any secrets in HTML comments.
HTML elements represent parts of a text document. A p
element
represents a paragraph in a text, a table
element represents a table, etc.
HTML elements are organized into a hierarchial structure. A HTML
document is a text document in which this structure is recorded.
Elements are written down using tags:
The above contains:
<
and >
) – p
,<p>
,</p>
,This is an introduction
,id='intro'
.Each HTML element can have attributes, which describe additional properties of the element. Attributes are entered in the start tag in arbitrary order, delimited by space. The above example contains one attribute with:
id
,intro
.Apart from elements, HTML can also contain entities which are placeholders for
special characters – for example >
represents the character >
(greater than).
Entities begin with an ampersand &
and end with a semicolon ;
, they may be encoded as either:
>
,>
(decimal) or >
(hexadecimal).There are two primary reasons for using entities. One reason is that you need to write a character which has a special meaning in HTML. e.g. writing this is incorrect:
Because the <
character denotes the start of a tag in HTML, you should write:
The web browser will take care of rendering the entity as the correct character.
Another reason for using entities is that you need to write a
Unicode character which you cannot type on the keyboard.
For example ❤
will give you a heart.
An HTML document should begin with the <!DOCTYPE html>
header. This denotes that
it is an HTML5 version document. White-space is mostly ignored (there are
few exceptions like pre
and textarea
elements). Which means that HTML
documents are usually formatted to underline the elements hierarchy:
The formatting is fully optional, so the above document is equal to the below one:
Let us have a look at an HTML document example:
If you display the above HTML source document in your browser, you will see a simple page:
At the beginning I said, that a HTML document is a hierarchial structure
(tree structure) of HTML elements.
In the above example, you can see that <html>
is
a root element, other elements <body>
and <head>
are its children.
The element <head>
is a sibling to <body>
. The <body>
element
has children <div>
, <div>
and <footer>
. It is very important to recognize this
hierarchical structure shown in the image below:
Notice that the hierarchical structure of HTML means that the following piece of code is not a valid HTML document:
It is important to understand that an HTML document as this hierarchical structure. It will become more and more important when dealing with stylesheets or a Javascript code.
There are over 100 elements available in the HTML language. The tag names of HTML elements may sometimes seem cryptic, so here is a list of commonly used ones with a little explanation of their names (sorted alphabetically):
<a>
– Anchor (i.e. a link within a text) (line element),<em>
– Emphasis (line element),<strong>
– Strong emphasis (line element),<p>
– Paragraph (block element),<table>
– well, a table (block element),<tr>
– Table Row,<th>
– Table Head (table header cell),<td>
– Table Data (ordinary table cell),<img>
– Image (line element),<ol>
– Ordered List (block element),<ul>
– Unordered List (enumeration) (block element),<li>
– List Item (block element),<div>
– Division (a generic paragraph) (block element),<span>
– Span (meaning range) A generic part of a line (line element).There are many more HTML elements, but don’t worry, you’ll learn them along the way. There are two basic types of HTML elements:
Block elements
represent generic paragraphs, i.e. they are blocks
of a text which span the entire line of the document (in other words – they always start a new text line).
For example a table cannot be inserted in the middle of a line. Inline elements
are parts of a line. Inline elements may be inserted in inline elements
or block elements. Block elements may be inserted only inside other block elements
The exception to this rule is the a
(link) element, which is
a line element, but block elements may be inserted into it. Be careful not to insert any
other active (clickable) element (e.g. input
) as a child of a
). Such a combination of
elements is forbidden.
Any element can have attributes. Attributes contain some additional information about the element. A typical example would be the Anchor element, which (apart from the link text) must contain additional information – the link target.
Attributes must be written in the opening tag, they are written in the form name='value'
, each
attribute can be specified only once. In the above example the attribute name is href
(meaning
‘hypertext reference’)
and the value is the actual URL link (https://youtube.com/).
The order of attributes in a tag is arbitrary.
You can use either single quotes '
or double quotes "
around an attribute value, there is
no difference in HTML.
Another example would be an image which requires two attributes – src
for the image source
and alt
for an alternative text (used when the image cannot be displayed):
The image element can have no other content apart the image itself. Instead of writing
<img src...></img>
it is shortened to the start tag only (see below for lengthy explanation).
There is a number of common attributes, it’s really good to know about:
id
— A unique identifier of the HTML element within the document (used,
e.g. in forms and CSS styles);class
— An arbitrary class of the HTML element, used mainly for CSS styles);title
— An arbitrary text which is shown when a user hovers over a HTML element;style
— A definition of an inline style;There are some oddities when writing down HTML elements:
Empty elements are elements that have no body – for example the img
element has no body, because its
content is defined by the image source. For empty elements the end tag is omitted:
<img src='http://example.com/image.png'>
(valid)<img src='http://example.com/image.png' />
(also valid)<img src='http://example.com/image.png'></img>
(also valid but not recommended)Boolean (true/false) attributes are true when they are present in the tag. They either have no value at all, or they must have their value equal to their name. If you want to set a boolean attribute to false, simple remove it:
<input type='text' required>
(valid, true)<input type='text' required='required'>
(valid, true)<input type='text' required='1'>
(invalid, still true)<input type='text' required='0'>
(completely wrong)<input type='text' required='false'>
(completely wrong)<input type='text'>
(valid, false)Data attributes start with the prefix data-
. Otherwise the name of the attribute
is arbitrary. For example you may write:
Data attributes are useful for passing arbitrary values to JavaScript. However their use would fall into the advanced category, so you probably won’t use them soon.
The head
element contains various information about the page itself, this is commonly
called metadata (data about data in the page). The head
section is not part of the
page itself, so do not confuse this with the actual page header (with logo, menu, etc.) –
that goes to the body
element. The head
element usually contains the following
children:
title
— Set a page title, used for example in the title of the browser window (required).meta
— Set the page metadata (e.g. encoding (see below), page description, author, keywords).style
— CSS styles in page.link
— A definition of related files (external style, fonts, etc.).script
— A JavaScript code or link to a JavaScript code.It is highly recommended to set page encoding with <meta charset='utf-8' />
. Although web
browsers mostly default to utf-8 encoding, you want to make sure that the page
is displayed correctly (even if the user e.g. saves the page to her local drive).
It is also important to save the source code in correct encoding (utf-8)
during its creation. It has to be the same encoding which you declare in the <head>
section. It is not enough to
put <meta charset='utf-8'>
into <head>
and store the source code in
ASCII encoding. The web browser may display some
characters incorrectly until you set your editor to save the file in the correct encoding.
Because web browsers try to be smart (and because end-users want to see the page content), they will try to display HTML pages even if they contain errors. Such pages may or may not render unexpectedly. Errors in HTML pages are tricky, because you might not notice them until there are too many of them and the browser rendering engine can’t take it any more.
To prevent this, make sure that your HTML pages are valid by checking them with the HTML Validator. This checks the document against all the rules outlined in the HTML specification. A valid HTML document has a really good chance that it will display the same way in all available web browsers. To ease the validation process, you should probably install some browser extension.
The standardization organization for all web related standards is W3C – WWW Consortium. The Current HTML version is HTML5 finally standardized in 2014 (with the minor revision 5.1 standardized in 2016). The previous version – HTML 4.01 is still used, you can recognize an HTML4 document by the following (or similar) beginning:
HTML4 is a subset of (sometimes called ‘an application of’) the SGML language. The subset (application) is outlined in a DTD (Document Type Definition) (can you spot it above?). HTML is in fact a quite complicated language, because there is great freedom how things can be written, for example the two documents below are equivalent and valid:
As you can see, in HTML lots of the code is optional, this is intentional so that the language behaves smartly (e.g. when you end a list, it’s clear that the list item is also ended). But this variability made implementations of web browsers quite complicated and in the end also proved to be somewhat confusing for HTML developers as well and ultimately lead to the development of the XHTML language, but before I get to that, I need to tell you about XML.
Parallel to HTML an XML (Extensible Markup Language) was born. The XML language is a generic language for data description, so its companions are e.g. CSV – Comma-Separated Values for tabular data or JSON – JavaScript Object Notation for any structured data. XML has nothing to do with HTML or the structure of text documents. It does not define any interpretation nor any rendering of the data.
However XML looks similar to HTML4 because they both have the same ancestor (SGML). It contains elements, attributes, entities and comments just like HTML does. But it has simplified writing rules:
<?xml version="1.0"?>
<element />
XML is a widely used language with many XML Applications (here ‘application’ means the interpretation of an XML document, not a program), which are again defined using DTD (Document Type Definition). An example of a random XML document:
HTML (at version 4) was complex to write and even more complex to be processed by web browsers. To some extend it felt like it was blocking further web development, because each browser implementation was complicated and different. Therefore people made efforts to simplify it. And one of them was the XHTML language. The XHTML language is an XML application of the HTML language. Which means that it has a simple parser (as XML does), is faster and contains the same elements as HTML.
Unfortunately a number of errors have arisen with this concept. E.g. the elements were not exactly the same; the existing pages were broken in the XML parser (because it is not able to recover in case of an error and simply fails to process the page). Therefore the agile browsers started to interpret XHTML documents with a HTML parser (to which they seemed malformed as well). Even greater confusion occurred, because it was unclear how the browsers interpret different documents. To end it all, W3C finally agreed to start all over with HTML5.
HTML5 is a brand new language which looks like HTML (yes, 3rd one).
It has no relation to SGML, so the <!DOCTYPE html>
is shortened and
no longer references a DTD. It is a new language without a lot of the
historic burden (but some of it is still there) with a complicated implementation of
the interpreter in a web browser. But it is also backward compatible (allows both the HTML and
XHTML syntax) and finally it is current – developed from 2007, standardized in 2014 with
a minor revision 5.1 standardized in 2016.
HTML5 also brought in new features:
article
, nav
, footer
, menu
, figure
, …),font
, center
, big
, …),The HTML5 standard describes the HTML language itself. The most important part of that is the allowed hierarchy of elements and their attributes. Apart from that, the standard describes main representations of HTML:
The three representations are not entirely equivalent
but for most practical work you don’t need to worry too much about this. I suggest
you use the HTML markup language over the XHTML markup language. If you use the XHTML langauge, you
have to set MIME type to application/xhtml+xml
and be prepared that a malformed document won’t be displayed.
There are dozens (maybe hundreds) of web browsers available for browsing the web. Although there are some clearly major browsers, it would still be very difficult to test your page against all available browsers to verify that the page is rendered correctly.
Source: StatCounter Global Stats - Browser Market Share
Luckily there are in fact not so many browsers as it seems. The important (from the HTML point of view) part of a web browser is its rendering engine (also called layout engine) and there are only few of those (sorted by current market-share):
There are other rendering engines with either negligible market-share or with use only in legacy applications. This leaves you with three to five browsers to test your web pages against (in case you are interested in full compatibility). The most important being Blink, Gecko and EdgeHTML. You can leave out WebKit, because it shares much of its code base with Blink and you may add Trident to verify compatibility with the Internet Explorer.
In this article I have described the HTML language principles. Contrary to popular belief, HTML is not just a text document with some tags. It is a document with a hierarchial structure of document elements. You should be aware of the basic properties of the HTML language and its relations to other languages and standards. You should be also aware of the features available in HTML5. Feel free to go through the corresponding exercises to get yourself acquainted with HTML.