Improving Query Performance

Introduction, Review and The Order of Things

the order of the syntax, as written in the query, is different from the processing order in the database

SELECT Country, Place, Magnitude
FROM Earthquakes
WHERE Magnitude >= 9
ORDER BY Magnitude DESC;

Syntax order:

  1. SELECT

  2. FROM

  3. WHERE

  4. ORDER BY Processing Order:

  5. FROM

  6. WHERE

  7. SELECT

  8. ORDER BY

the table of processing order:

1. FROM
2. ON
3. JOIN
4. WHERE 
5. GROUP BY
6. HAVING

7. SELECT 
8. DISTINCT
9. ORDER BY
10. TOP

thosw 2 queries will produce the same results, but it's not the correct way to use having

in this case, where is more effient because rows are filtered first before grouping and summing

  • tip: Don't use HAVING to filter individual or ungrouped rows

  • tip: Use WHERE to filter individual rows and HAVING for a numeric filter on grouped rows

because a numeric column that is using a HAVING filtering condition must be enclosed in an aggregate function?

  • note: HAVING is always after GROUP BY

Filtering and Data Interrogation

Interrogation after SELECT

Here we're going to discuss the effect of Interrogation over performance processing order after select

SELECT * is very bad for performance, always select the columns you need.

SELECT * in joins returns duplicates of joining columns

and output so you need to explicitly state the columns using aliases

_> actually there's no top of bottom, with ordering we can sort the data desc or asc then choose the top, so we could the choice the top bottoms or the top tops!

  • tip: ORDER BY works well for interrogating data but slows performance in a query

Managing duplicates

DISTINCT() and UNION() and their potential effects on performance

-> are there any differences between those 2 queries?

-> the answer is 'No', they're pretty similar

_> if we now for sure there's on duplicates in the table, then use UNION ALL because It didn't use a sorting algorithms that slow the performance!

Sub-queries and presence or absence

There are 3 places to use sub-query with [1] FROM the result of this query show seven customers had to wait 35 days or more for their order to be shipped [2] WHERE the result of this query show three customers having made orders with a freight weight of more than 800 kilograms [3] SELECT

Types of Sub-queries

[1] Uncorrleated

[2] Correlated

  • tip: correlated sub-queries can be quite inefficent because the sub-query executes for each row in the outer query

  • tip: correlated sub-queries can be replicated using inner join

consider using inner join, which only makes one pass through the data!

Practice

You want a query that returns the region and countries that have experienced earthquakes centered at a depth of 400km or deeper. Your query will use the Earthquakes table in the sub-query, and Nations table in the outer query.

A friend is working on a project looking at earthquake hazards around the world. She requires a table that lists all countries, their continent and the average magnitude earthquake by country. This query will need to access data from the Nations and Earthquakes tables

You want to find out the 2017 population of the biggest city for every country in the world. You can get this information from the Earthquakes database with the Nations table as the outer query and Cities table in the sub-query.

create this using: [1] correlated sub-query

[2] inner join

Presence and Absence

Checking the presence or absence of data is common task of a data scientist.

INTERSECT and EXCEPT

INTERSECT is one of the easier and more intuitive methods used to check if data in one table is present in another

EXCEPT does the opposite of INTERSECT. It is used to check if data, present in one table, is absent in another.

Practice

You want to know which, if any, country capitals are listed as the nearest city to recorded earthquakes. You can get this information by comparing the Nations table with the Earthquakes table.

You want to know which countries have no recorded earthquakes. You can get this information by comparing the Nations table with the Earthquakes table.

Alternative methods

EXISTS filter the outer query when there is a match of data between the outer query and a sub-query It either evaluates for true or false on a match,

IN works the similar way with sub-queries, however we need to specifiy the columns, to match on, in the WHERE filter condition of the outer query, and the SELECT statement of the sub-query

EXISTS VS IN They are both the same, but with small difference the sub-query with EXISTS looks like a corrleated sub-query which we said previously can be inefficent (technically it's a correlated sub-query )

BUT,

  • EXISTS willl stop searching the sub-query when the condition is TRUE

  • IN collects all the results from a sub-query before passing to the outer query! which is potentially slower

  • tip: consider using EXISTS instead of IN with a sub-query

NOT EXIST and NOT IN are the opposite, they check the presence of data in one table that is absent in another!

NOT IN and NULLS there's one major issue using NOT IN with null values

if the columns in the sub-query being evaluated for a non-match, contain null values, no result are returned

to handle this behaviour, the sub-query needs IS NOT NULL

  • tip: consider using not exists instead of not in because of the nullish value misbehaviour!

Practice

You want to know which country capitals have never been the closest city to recorded earthquakes. You decide to use NOT IN to compare Capital from the Nations table, in the outer query, with NearestPop, from the Earthquakes table, in a sub-query.

Alternative methods 2

inner join and left outer join

An exclusive LEFT OUTER JOIN can be used to check for the presence of data in one table that is absent in another table.

Adv and disadv of all methods

Query performance tuning

in this section we're going to discuss:

  • how to quantify and measure performance

  • some performance techniques to use for performance tunning

Time statistics

statistics time -> from slides

  • cpu time: slides

  • elapsed time; total duration of the query from Execution to returning the complete results back to us

Elapsed time vs cpu time

Page read statistics

Another way of measureing performace is by measuring the amount of

the output will be look like

we've read 32 data pages from memory to complete the query!

Awesome! Logical reads are a measure of the number of the 8-kilobyte pages read from memory to process and return the results of your query. In general, the more pages that need to be read the slower your query will run.

indexes

Indexes are structures added to the table to improve speed of accessing data from a table.

used to locate data quickly without having to scan the entire table. this makes them particulary useful for improving performance of Queries with filter conditions they are applied to a table columns and can be addd at any time.

the two most common table indexes are Clustered and Nonclustered

Clusterd and Nonclustered Indexes

  1. Clusterd Index: A good analogy for a clustered index is a dictionary where words are stored "alphabetically".

Clustered indexes reducee the number of data page reads by a query which helps speed up search operations.

A table can only contain one Clustered index.

  1. Non Clustered Index: A good analogy is a text book with an index at the back

Data in the book is unordered and the index at the back indicated the page number containing a search condition.

Another layer in the index structure contains ordered pointers to the data pages.

A table can contain more than one Nonclustered index.

It's common used to improve table insert and update operations.

Clustered Index: B-tree structure

A clustred index creates what is called a B-tree structure on a table.

  • Root node

  • branches the nodes

  • leaves

[-] The root node: the root node contains ordered pointers to branch nodes which in turn contains orderd pointers to page nodes.

the page node level contains all the 8 kilobyte data pages from the table with the data physically ordered by the columns(s) with the clustred index.

Great job! A clustered index will reduce the number of logical reads because the index will direct the query to the table data pages that meet the filter condition. Without a clustered index, all pages are scanned.

Execution plans

Last updated