intro

module1 : Your first database

Introduction

This is an introduction to relational databases with practical case we are going to investigate to get our hands dirty and not just rely on the theortical side. In the resources section below I mentioned some of the cool resources to get started learning and working with sql and relational databases.

our case:

Why to use relational database?

  • real life entites becomes table: A database models real-life entites like professors, students, universities by storying them on tables

  • reduces redundancy: each table contains data from a single entity type, this reduces redundnacy by storing entites only once

  • data integrity by relationships: the database can be used to model relationships between entities, for example: a professor can work on a mutliple universities, a company can have mutiple workers we can define how excatly data relates to each others

what you're going to learn:

  1. creating and manipulating with the data

  2. creating a relational database from scratch

  3. learning the following three concepts, which help preserve data quality on databases:

    1. constraints

    2. keys

    3. referential integrity

Query information_schema with SELECT

Information_schema: is a meta-database that holds information about your current database, it has multiple tables you can query with the select * from syntax:

  • tables: information about all tables in your current database

  • columns: information about all columns in all of the tables in your current database

  • and many more.

public schema holds information about user-defined tables and databases

tables: the core of relational databases:

tables is the most important concept of relational database, In this section we are going to see how deal with redundancy in tables let's see what's inside the first three records of 'university_professors' tables

Image

Here we will begin to notice that professor 'karl' is repeated on the first three column and his university 'ETH Lausanne' also repeated couple of times because he only works for this university, However he has a affliations with at least three differenet organization as seen So there is a certain redundancy in this table, and the reason for this redundancy is that table contains entites of at least different types let's look at them

image
  1. professors highlighted in green

  2. universities highlighted in blue

  3. organization highlighted in brown

let's look at the database once again, the graphic shown here is called entity-relationship diagram

image

squares denotes entity types, while circles denote attributes(columns) as we said before that the table 'university_professor' contains at least three different entities, so we need to model them now, each one at it's own table

image

Now after we've created the entity-relationship diagram, the next step is to actually make those tables and connected them together But before we actually do this, we need a refresher of the syntax to do so:

CREAT TABLE syntax

Your first task:

  1. create table professors table with two text column: firstname, lastname

  2. create table university table with three text column: university_shortname, university, university_city

solving the task:

oops we forget to add university_shortname to professor table! don't worry to do so you need to alter the table then add the column the syntax to do so:

Let's to it

updating the tables

only the university_professor holds data. Now we've 5 different tables so far, it's time to migrate data from the university_professors table to the 4-other tables

If we have another look at the data on the university_professor table, we see:

image

but if we see the unique organizations stored on the university_professor table:

image

So before coping the data from the original table, we need first to make sure we copy the distinct data to reduce redundancy, in order to this we use the ""INSERT INTO SELECT DISTINCT" patterns the syntax as follow:

so out query will be like this:

In the last section when we create the tables, in the affliations table, instead of typing 'organization' we've typed 'organisation, so we need to fix this now, to do so the syntax as follow:

also we don't need the university_shortname column, so are going to delete(drop) it from affliations table:

but why did we delete the university_shortname at the first place? querying the university_professors table for uniquefirstname, lastname and university_shortname results of 551 unique records

image

and querying only the firstname and lastname results of 551 unique recordss image

so the firstname and lastname uniquely identify a professor

Now after migrating the data from the university_professor table to other table, it is not needed anymore, so we are going to delete it.

module2 : Enforce data consistency with attribute constraints

Better data quality with constraints

You are going to learn:

  1. what are constraints? and why to use them?

  2. making our data consistent using those constrains

Althought we have setup a simple database, we haven't used most of the features database provides the idea of database it push data into certain structure, a pre-defined model where you enforce data types, relationships, and other rules these rules are called 'Integrity constrains'

Integrity constrains:

Integrity constraints can be divided into three types:

  1. Attribute constraints: e.g data type

  2. key constraints: e.g primary data type

  3. Referential integrity constraints: e.g foreign keys

In this section we are going to explore the first type, and in the next two sections, we will look at the other two

why constraints?

  • Constrains enforce a certain data form: things can get messy when people are in charge of entering data, so we need to make sure that data entered are consistent and have a rigid form!

  • Help enforce data qualiy, and data quality is important for both users and developers

here are the common data types on postgreSQL that work as attributes constraints: image

Data types also restrict possbile sql operations on stored data: for example

this will produce error, exactly as almost programming language will do

the solution for this is to use CASTing as follows:

This is on-the-fly type conversion

working with data type:

  • Enforced on columns (i.e. attributes)

  • Define the so-called 'domain' values of a column: meaning what form these values can take and what not

  • Define what operations are possible -Enforece consistent storage of values

PostgreSQL datatypes(some):

  • text : character strings of any length

  • varchar [ (x) ] : a maximum of n characters

  • char [ (x) ] : a

  • fxed-length string of n characters

  • boolean : can only take three states, e.g. TRUE , FALSE and NULL (unknown)

specifying types while creation:

specifying types after creation:

Now let's get back to our database tables, we need to do some Constrains

Task:

  1. in professor table specify a fixed-length character type with the correct length for university_shortname.

  2. change the type of firstname to varchar(64)

solution

Convert types USING a function

If you don't want to reserve too much space for a certain varchar column, you can truncate the values before converting its type.

The not-null and unique constraints

those are the most important two attributes constraints

Not-null constraints:

  • Disallow any NULL values

  • Must hold true for the current state and the future state so you can't insert null on that columns

what does null means?

Nothing xD, HAHAHAHA actually it does represent 'non exising value':

  • Does not exist or

  • unknown

adding or removing not-null constraints

  1. While creation

  1. After creation:

Unique constraint:

  • Disallow duplicate values on a column

  • Must hold true for the current and future

what does 'holding true' means? meaning that you couldn't apply unique constrains if the column does hold actually duplicates, because think of it, how the DBMS know now what to remove from what to persist?

Adding or removing unique constraints:

TASK:

  1. add not-null constraints on professors' firstname and lastname

  2. make universities' shortname unique

  3. Make organizations.organization unique

solution:

module3: Uniquely identify records with key constraints

keys and superkeys

let's take a look at our current database: image

After completing this section, our database will look like this: image

what is that underlyed id? where did it come from? and why? we will examine all those question right now

what is key:

  • Attribute(s) that identify a record uniquely: Normally, a table, as whole, only contain unique records, meaning that the combination of all attributes is a key in itself, However it's not called a key, but a superkey if attributes from that combination can be removed, and attributes still uniquely identify records If all possible attributes have been removed but the records are still uniquely identifed by the remaing attributes, we speak of minimal superkey

  • As long as attributes can be removed: superkey

  • if no more attributes can be removed: minimal superkey or key: this is a actual key, so the key is always minimal

Let's look at an example: a database from the textbook: Fundementals of Database Systems

image

the table shows 6 different cars, so the combination of all attributes is attributes

if we remove the year from the the superkey, the six records are still unique, so it's still a super key

Actually there are alot of superkeys in this example

However there are only 4 minimal superkeys

remember that superkeys are minimal when no attributes can be removed without losing the uniqueness property, this is trivival for k1 to k3 because they only consist of single attribute

also if we remove year from k4, 'make' would contain duplicates, and would, therefore be no longer suited as key.

there four minimal superkeys are also called candidate keys why candidate keys? In the end, there can only be one key for the table, which has to be chosen from the candidate keys

Identify keys with SELECT COUNT DISTINCT:

How to know if a given column could be qualified to be a key? There are some steps you can follow to know:

  1. Count the distinct records for all possible combinations of columns. If the resulting number x equals the number of all rows in the table for a combination, you have discovered a superkey.

  2. Then remove one column after another until you can no longer remove columns without seeing the number x decrease. If that is the case, you have discovered a (candidate) key.

The table professors has 551 rows. It has only one possible candidate key, which is a combination of two attributes, so let's try to find that candidate key

output: 551, let's try to remove firstname

output: 546, so let's try to remove lastname and use firstname instead

output: 479, so let's try again removing the university_shortname

output: 551 , that's it Indeed, the only combination that uniquely identifies professors is {firstname, lastname}. {firstname, lastname, university_shortname} is a superkey, and all other combinations give duplicate values. Hopefully, the concept of superkeys and keys is let's move to primary key!

primar key:

primary key is one of the most important concepts in database design, almost every database shoud have primary key chosen by you from the set of candidate keys. the main purpose as already explained, is uniquely identifying records in a table this makes it easier to reference these records from other tables. you may have seen that primary keys need to be defined on columns that don't accept duplicat or null values

the syntax of specifying primary key:

  1. upon creation

  1. to exising table:

so back to your database tables, we now need to make a primary key for every table we have, so let's do it:

Surrogate keys: (_id)

surrogate keys are sort of an artificial primry key. they are not based on a native column on your data, but on a column that just exists for the sake of having a primary key.

  • Primary keys should be build from as few columns as possible

  • Primary keys should never change over time.

let's look at this table: image in this table, the 'license_no' column would be suited as the primary key - the license_no is unlikely to change over time, not like the color column, for example, which might change if the car is repainted. so there's no need for surrogate key here but if we have this table: image the only sensible primary key here for this table is the combination of {make, model} but this are two column for a primary key we can add a surrogate key column, called "id" to solve this problem

Adding a surrogate key with serial data type

using serial data type in PostgreSQL which allow us to add auto incrementing number column

Another type of surrogate key

back to our database tables, we need to add surrogate key to professors table, because it's attributes can't make a suitable primary key, we could have two professor with the name working for different universites which does make duplicates

an example of combining two table together:

module4: Glue together tables with foreign keys

Model 1:N relationships with foreign keys

  • a foreign key(FK) points to the primary key(PK) of another table

  • Domain of FK must be equal to domain of PK: there are some resitrictions to the foreign keys: first the domain and the datatyp must be the same as one of the primary key. seconldy

  • Each value of FK must exist in PK pf the other table(FK constraint or 'referential integrity'): only foreign keys values are allowed that exist as values un the primary key of the referenced table

  • FKs are not actual keys: because duplicates are allowe

our database model will be like this: image

A more than just one professor can work into one univeristy not multiple universities at the same times

example of the syntax used to make foriegn keys:

  1. upon creation

  1. after creation

now let's implement this on our database tables

the result is image

Model more complex relationships

how to implement many to many relationship?

  • create ordinary table that contain two foriegn keys that points to both connected entites

in our tables: we have this relationship now image as professor can work for multiple organizations and an organization could have mutliple professors working for it

in this case we need to add additional attributes such as function

notice that: no primary keyis defined here, because theoritcaly a professor can have mutliple functions in one organization

now we will be populating the professor id column on affliations from professors.id column directly here is the syntax:

now let's implement this:

let's drop firstname and lastname from affliations, we don't need it anymore

Referntial integrity

round up

count afflitions per university

Join all the tables together

In this last exercise, your task is to find the university city of the professor with the most affiliations in the sector "Media & communication".

For this,

you need to join all the tables, group by some column, and then use selection criteria to get only the rows in the correct sector. Let's do this in three steps!

Last updated