In the previous articles, you have learned about the principles of relational databases, the basics of SQL and aggregation in SQL. In all those articles, I used the example database from the project. There are a number of questions you should be asking yourself by now, such as:
In this article, I’ll show you how to design your own database structure. The approach I will describe has no formal methodology. It is a kind of common-sense approach which is suitable for small or middle sized databases. A middle sized database is something you can still keep in your head – for most people it is around one hundred tables.
Database (or data) modeling is a process in which you analyze the application requirements (usually retrieved from your customers) and design a database schema. The database schema describes the structure of data stored in your database. Before you move on to the application implementation, you should validate whether the schema follows certain good practices (normal forms) and fulfills the customer requirements.
The outcome of database modeling is a database model. There are many approaches that can be used, for a relational database, the best and most common option is an Entity-Relationship model. This model is written down using ERD (Enity-Relationship Diagram). An alternative to the ER model is e.g. an Object Oriented Model.
E-R Modeling works with the following concepts:
If you still remember the principles of relational databases, the above should be very familiar to you. It uses the same words and same concepts as relational algebra – apart from relation and algebra :) E-R Modeling is a tool to create diagrams, which nicely convert to relational databases. If you find E-R modelling similar to object oriented modeling, that is good too, because they both are tools to solve the same problem – describe and design the structure of data.
Do not confuse relation and relationship! While relationship is colloquially abbreviated to relation, in databases, they have to be distinguished. Can you describe the difference?
A relation is a set of entities, usually written down as a table. A relationship is a connection between entities.
The most common representation of an E-R model is an ERD (E-R Diagram). There are multiple variants - e.g. Classic (conceptual), Crow’s foot, … Probably the most common one is Crow’s foot (which I will show you later). I will Start with the Classic / Conceptual diagram, because it is very useful at the beginning of database design and it is especially good for beginners – it requires only brain, paper and a pen. This means, when you get more proficient, you can safely abandon it.
The Conceptual diagram uses the following notation:
Let’s look at the project assignment:
Create a web application for recording persons and contacts. The main goal of the application is to record persons (friend, acquaintances), their addresses, relationships and meetings. Each person can have a name, nickname, age, location and contacts. Each person can have any number of contacts (mobile, Skype, Jabber, ….). A person can have more contacts of the same type (e.g. two emails). Each person can have any number of relationships with other persons in the database. Each relationship should be of a type (friend, fiend, acquaintance, spouse, …) and description. The contact and relationship types are recorded in the database and can be modified by the end-user. The application also records meetings between persons. Each meeting can be joined by any number of persons. Each meeting should have a place and date. The application must allow user friendly entering and modifying the data. Take advantage of the proposed schema, create a database and implement the entire application.
Now start with identifying entities and their attributes. The simplest approach is to underline every noun in the requirements description (I will ignore general terms: data, database, goal, and application):
Create a web application for recording persons and contacts. The main goal of the application is to record persons ( friend, acquaintances), their addresses, relationships and meetings. Each person can have a name, nickname, age, location and contacts. Each person can have any number of contacts ( mobile, Skype, Jabber, ….). A person can have more contacts of the same type (e.g. two emails). Each person can have any number of relationships with other persons in the database. Each relationship should be of a type ( friend, fiend, acquaintance, spouse, …) and description. The contact and relationship types are recorded in the database and can be modified by the end-user. The application also records meetings between persons. Each meeting can be joined by any number of persons. Each meeting should have a place and date. The application must allow user friendly entering and modifying the data.
Now you take the nouns aside and you should obtain the following list:
Now you can directly see entities and their attributes. If you notice an attribute being another entity (a person can have a contact), you have just stumbled upon relationship. From here you should be able to draw the conceptual ERD. Really try to do it yourself before you look at the solution.
The above diagram is one of an infinite number of solutions. The requirements are always incomplete therefore you have to apply common sense and your own judgement to fill in the blanks (for example, what are the address attributes?, what other attributes does a contact needs to have?). Therefore your solution is probably different from mine and it is perfectly ok. There is no single right solution or single point of truth.
Also bear in mind that we are making a model of the data the application will use. The most important feature of any model is that it is simplified. Therefore the task here is not to enumerate all attributes a person can have. The task is to enumerate all attributes of a person which will be useful in the application. Also keep in mind that the database model is bound to change. Therefore if you are not sure about something, don’t worry too much about it. There is always the possibility (and often the need) to change it.
When you have the entities, attributes and entity relationships, you can move to the next step. This is evaluation of relationship cardinality. Relationship cardinality (or relationship degree or cardinality ratio) describes how many entities can relate to the other entity in the relationship. Relationship cardinality is always evaluated two-way.
Relationship cardinality is usually (but not necessarily) evaluated together with Optionality. Optionality describes if one entity that depends on another one can be either mandatory or optional. I.e. if an entity can exist without being related to the other entity in the relationship.
For example, if you have entities person
and car
and the relationship
drives
, you can ask the following questions:
The answers above lead to the relationship DRIVES(PERSON:(0,1), CAR:(0,1))
. Now let’s
consider the relationship owns
.
The answers above lead to the relationship OWNS(PERSON:(0,N), CAR:(1,1))
. There are three types of
relationship cardinalities:
DRIVES(PERSON:(0,1), CAR:(0,1))
OWNS(PERSON:(0,N), CAR:(1,1))
USES(PERSON:(0,M), CAR:(0,N))
If the relationship allows multiple entities, we do not count them precisely. For the
database design, it is important to know only whether there can be one or more. Now
you probably think – it is possible that two people own a single car or
car does not have to be owned by anyone or something like that. The truth is that
every real relation in the world is M:N or 0..M:0..N
. Because you can always find
some kind of exception to the above answers. Consider for example the drives
relationship.
Is it possible that one car is driven by two people? Is it possible that a single
person can drive multiple cars?
Yes, for example in the driving school or consider for example Abby and Brittany Hensel or remote controlled cars.
Remember that we are making a model of the real-world. Therefore you should
always select the lowest cardinality ratio acceptable for the application.
This is extremely important. The database model must be a usable simplification
of the real world. If it were not a simplification, the result application would be
infinitely complex (and therefore infinitely expensive). Therefore in the above case
you would probably consider drives
a 1:1 relationship.
This poses the question – how to deal with the exceptions in the database model. The answer to this question is very difficult and depends highly on the application itself. Sometimes a field allowing the user to enter an arbitrary note is sufficient. Sometimes it is not.
Once you have the E-R model – the entities and their attributes and the relationships and their cardinalities – you have to convert it to the actual database schema. The E-R conceptual model is very close to the relational database model (as you have probably noticed by now):
A relationship in the ER model can be represented in two ways in a relational database schema:
It is obviously easier to implement it using only a column and use the table only if
necessary. The relationship must be represented by a table either when the
relationship cardinality is of the type M:N or when the relationship itself has attributes.
An example of a relationship with attributes is: TENANCY(LANDLORD, TENANT, BEGIN_DATE, END_DATE, RATE)
.
All of the attributes BEGIN_DATE
, END_DATE
and RATE
are attributes of the tenancy relationship.
They are not properties of the landlord neither of the tenant.
For example if you have a relationship between person
and location
, you have multiple options
(depending on the application requirements):
person
table:
PERSON(ID_PERSON, FIRST_NAME, NICKNAME, CITY, STREET)
location
table and link it to person
:
PERSON(ID_PERSON, FIRST_NAME, NICKNAME, ID_LOCATION)
LOCATION(ID_LOCATION, CITY, STREET)
PERSON(ID_PERSON, FIRST_NAME, NICKNAME, ID_LOCATION)
LOCATION(ID_LOCATION, CITY, STREET)
RESIDENCE(ID_PERSON, ID_LOCATION, BEGIN_DATE)
A table which exists to represent a M:N relationship is called an association table. Typically,
the association table has a compound primary key – id_person
+ id_location
in the above
residence
table. In the sample database,
there is a person_meeting
association table.
Sometimes this is also referred to as a weak entity. The
person_meeting
(or residence
) entities have no primary key of their own. They are only
identified by foreign keys to other entities.
By converting the relationships, your database schema will gain some new columns and tables. Then you should check if the database is normalized. Normalization is conversion to normal forms (NF). There are more normal forms, but first three are most important:
Normal forms are only recommendations (violation must be justified). Normal forms are nested – if a relation is in 3. NF, it must be in 2. NF and 1. NF too. If a database is normalized, it has some positive properties. If a database is not normalized, you must verify those properties by yourself (or live without them).
The first normal form requires that a single value (table cell) of each attribute is
atomic – which means indivisible. That means that it contains only a single value
and that value cannot be split into something else. Consider this relationship
table which
records relations between two persons:
1_person_name | 2_person_name | 1_person_address | 2_person_address | begin_date |
---|---|---|---|---|
Karl Oshiro | Marcel Miranda | Mozartstraße 33, Linz | Fountain Rd 27, Stirling | 2016-01-06 |
Remona Deen | Karl Oshiro | Old Rd 182, Muir of Ord | Mozartstraße 33, Linz | 2015-12-20 |
Tuan Brauer | Marcel Miranda | Davenport St., 12, Bolton | Fountain Rd 27, Stirling | 2015-10-09 |
This definition of the table relationship
is not in 1. NF because the first four columns are not atomic.
For example, it is unclear what first and last name is. It is also unclear what the city and street is. Nothing
prevents the end-user from entering any string into any of the fields – in some countries and languages,
the street number is written first, in some it is written last. The same goes with the name – what is
the family name in Park Geun-hye ? These ambiguities make it impossible to reliably sort
by the last name or city, or to reliably search for persons living only in a certain city etc.
The (hopefully) obvious solution is to split the value into multiple columns (first_name
, last_name
,
city
, street_name
, street_number
). Normalization to 1st NF therefore adds columns to your
database schema. While this may sound easy, you will probably run into situations where it is not
entirely easy to decide what is atomic and what is not.
Consider for example the column begin_date
, it is a date value, which consists of a year, month and day.
Is this atomic? Yes and no. If you consider it being a date value – i.e. a reference to a point in
time – it is atomic. If you consider it a string of three values (in some ambiguous format), then
it is not atomic. So under normal circumstances, you would consider date as being atomic and use
a single column for it. Is it useful to represent date as three separate values? Sometimes. For example, when
one of the parts of the date is optional – something has to be done on 10th day of every month; something
was manufactured on January 2016, etc. When one part of the date is missing, it ceases to be a
reference to a point in time and starts to be a collection of (three) values.
You have to keep in mind that you are making a model of the data and therefore atomicity is evaluated to the application needs. One might for example argue that first_name ‘Karl’ is a string of four characters and therefore it is not atomic. However, from the application point of view, it is a single value – first name, which cannot be split into meaningful (for the application) parts.
While all this may seem like nitpicking, it is very important. All these reflections and considerations help you understand the nature of the problem you’re solving in your application. Thus they help you design the database structure correctly.
Below is a modified relationship
table recording relationships between persons. The table is
in 1st NF, the column ssn
contains Social Security Number.
The key of the table is a combination of 1_ssn
and 2_ssn
because that identifies the
relationship between two persons.
1_name | 1_surname | 1_ssn | 2_name | 2_surname | 2_ssn | begin_date |
---|---|---|---|---|---|---|
Karl | Oshiro | 123 | Marcel | Miranda | 987 | 2016-01-06 |
Remona | Deen | 456 | Karl | Oshiro | 123 | 2015-12-20 |
Tuan | Brauer | 789 | Marcel | Miranda | 987 | 2015-10-09 |
The above table is not in 2nd NF. A table is in 2NF if
it contains no partial functional dependencies of non-key attributes on the key.
The column begin_date
has full functional dependency
on the key, because both parts of the key (1_ssn
and 2_ssn
) are necessary to identify the begin_date
of a relationship. On the other hand, the columns 1_name
, 1_surname
,
2_name
and 2_surname
have only partial functional dependency on the
key because they depend only on either 1_ssn
or 2_ssn
. I.e. changing
the value of 2_ssn
in the first row would have no effect on Karl Oshiro.
Therefore the table is not in 2nd NF. A table which is not in 2nd NF contains
a lot of redundant data.
If a table is not in 2nd NF, it suffers with Insert / Update anomaly and Delete anomaly. Delete anomaly would occur if you deleted the last relationship from the table. With deletion of the relationship you would also loose all records about the Tuan Brauer person. Similarly, an insert anomaly occurs when you want to insert a new person in the database. You would have to fabricate a relationship for the person.
2nd NF applies only to a table which has a compound key. If a table has no
compound key, then no partial dependencies can occur and the table is in 2nd NF.
Although in practice, most tables have a compound key. Violation of 2nd NF comes
from an incorrect splitting of entities in the analysis phase. While it may
seem obvious in the above case, where I have mixed person
and relationship
into a
single table, there are cases when the differences may be much more subtle.
However, the insert / update / delete anomaly is something you will definitely notice
and the only way to avoid it is to redesign the database.
Redesigning the database means splitting the above table into two tables – relationship
:
1_ssn | 2_ssn | begin_date |
---|---|---|
123 | 987 | 2016-01-06 |
456 | 123 | 2015-12-20 |
789 | 987 | 2015-10-09 |
And person
:
ssn | name | surname |
---|---|---|
123 | Karl | Oshiro |
456 | Remona | Deen |
789 | Tuan | Brauer |
987 | Marcel | Miranda |
The above two tables are in 2nd NF (and also in 1st NF). No data is redundant and inserting a person does not require inserting a relationship. Similarly, deleting a relationship does not delete a person. To convert a table into 2nd NF, you have to split it into two tables. Splitting must not cause information loss – it must be possible to reconstruct the original information by joining the tables.
The definition of 3rd NF requires that no non-key attribute transitively depends on the key. Every
non-key attribute depends on a key (that is a definition of the key).
I.e. if you change the value of id
in the below table, all the other columns will change too
(because it will be a different address).
id | street | num | city | zip |
---|---|---|---|---|
1 | Mozartstraße | 33 | Linz | 4020 |
2 | Old Rd | 182 | Muir of Ord | IV6 7UJ |
3 | Davenport Street | 12 | Bolton | BL1 2LT |
4 | Malt Street | 62 | Bolton | BL1 2LT |
The above table is not in 3rd NF because it contains transitive dependencies on the key.
The attribute zip
depends on the attribute city
– if you change the value of
city
, the zip
code changes too. Therefore the attribute zip
depends on the key
(which is id
) transitively.
A table which is not in 3rd NF contains redundant information, which can
be seen in the last two rows. Redundant data in tables leads to
inconsistency.
To convert the table into 3rd NF, means to split it into two tables addresses
:
id | street | num | zip |
---|---|---|---|
1 | Mozartstraße | 33 | 4020 |
2 | Old Rd | 182 | IV6 7UJ |
3 | Davenport Street | 12 | BL1 2LT |
4 | Malt Street | 62 | BL1 2LT |
And zip_codes
:
zip | city |
---|---|
4020 | Linz |
IV6 7UJ | Muir of Ord |
BL1 2LT | Bolton |
The above database definition is in 3rd NF. Converting to 3rd NF produces so called
reference tables. These are
(usually two-column) tables which contain lists
of allowed values for something. In the sample database, there are the reference tables
contact_type
and relation_type
.
However, the above example with addresses is not entirely perfect. Addresses do have very high
variability over the world. While the above may be acceptable for continental Europe, there
may be countries where there are no streets, or street numbers or places without a zip code or
whatever. The normalization of addresses is often not very strict. Sometimes, you may run into
databases which simply have address_line_1
and address_line_2
fields. Which is obviously
not atomic, but it may be much more practical in some cases (if you use the
address only as a ‘single value’ printed on a package or letter).
This leads us to the concept of denormalization. It is perfectly OK to use a relational database which is not normalized, provided that you know why and understand the consequences.
One example of this is intentional avoidance of normalization like in the case of addresses. The risk of inconsistency is minimal, because the end-user always enters the entire address (including the city and zip code) and therefore you may consider the entire address to be a single value. The inconsistency is likely to happen in case someone decided to renumber zip codes, which happens, but so rarely, that you can deal with it manually when it happens. All this means that it may be simply impractical to normalize addresses fully, because the positives are not going to outweigh the increased implementation complexity.
Another example of using a database which is not normalized are analytical databases. These databases are build from normalized OLTP databases and intentionally store data in joined tables to make select queries easier and faster. Sometimes such tables may be used in OLTP databases too, but they should always be built from the base normalized tables.
When creating a database schema, we go through the following steps:
By following the above steps, you should transition from a conceptual model (which contains entities and relationships between them):
To a logical model model, from which the database schema can be created (a physical model):
The above diagram is also an ERD in so called Crow’s foot notation. This notation displays relations (tables) and their definitions directly. With proper design software, the SQL queries to create tables in the database can be generated automatically. You can then start using the database.
Table and column names are usually typed in lowercase. Using uppercase letters in a table or
column names cause the need to enclose their names in double quotes when typing SQL queries in
PostgreSQL. For example the following query SELECT * FROM UserAddress
won’t work (even if such
a table exists), you have to use double quotes: SELECT * FROM "UserAddress"
because PostgreSQL
converts the able name to lowercase without quotes. An underscore character is used to divide words,
because the minus sign is reserved for subtraction. Therefore the easiest way is to use user_address
as a table name in this case. You can name your tables in singular or plural,
both ways are used.
In this article I have described the process of designing a database given textual requirements for an application. Keep in mind that the above description is applicable to small or middle-sized databases. For huge databases you need to use a proper methodology. Also keep in mind that this is a creative process with no ‘one true solution’ so you have to decide and choose one of the many different options. ERD and database normalization are tools that help you along the way, but ultimately you must be able to defend your own design decisions.