Improving Query Performance
Introduction, Review and The Order of Things
the order of the syntax, as written in the query, is different from the processing order in the database
SELECT Country, Place, Magnitude
FROM Earthquakes
WHERE Magnitude >= 9
ORDER BY Magnitude DESC;Syntax order:
SELECT
FROM
WHERE
ORDER BY Processing Order:
FROM
WHERE
SELECT
ORDER BY
the table of processing order:
1. FROM
2. ON
3. JOIN
4. WHERE
5. GROUP BY
6. HAVING
7. SELECT
8. DISTINCT
9. ORDER BY
10. TOPthosw 2 queries will produce the same results, but it's not the correct way to use having
in this case, where is more effient because rows are filtered first before grouping and summing
tip: Don't use HAVING to filter individual or ungrouped rows
tip: Use WHERE to filter individual rows and HAVING for a numeric filter on grouped rows
because a numeric column that is using a HAVING filtering condition must be enclosed in an aggregate function?
note: HAVING is always after GROUP BY
Filtering and Data Interrogation
Interrogation after SELECT
Here we're going to discuss the effect of Interrogation over performance processing order after select
SELECT * is very bad for performance, always select the columns you need.
SELECT * in joins returns duplicates of joining columns
and output so you need to explicitly state the columns using aliases
_> actually there's no top of bottom, with ordering we can sort the data desc or asc then choose the top, so we could the choice the top bottoms or the top tops!
tip:
ORDER BYworks well for interrogating data but slows performance in a query
Managing duplicates
DISTINCT() and UNION() and their potential effects on performance
-> are there any differences between those 2 queries?
-> the answer is 'No', they're pretty similar
_> if we now for sure there's on duplicates in the table, then use UNION ALL because It didn't use a sorting algorithms that slow the performance!
Sub-queries and presence or absence
There are 3 places to use sub-query with [1] FROM the result of this query show seven customers had to wait 35 days or more for their order to be shipped [2] WHERE the result of this query show three customers having made orders with a freight weight of more than 800 kilograms [3] SELECT
Types of Sub-queries
[1] Uncorrleated
[2] Correlated
tip: correlated sub-queries can be quite inefficent because the sub-query executes for each row in the outer query
tip: correlated sub-queries can be replicated using inner join
consider using inner join, which only makes one pass through the data!
Practice
You want a query that returns the region and countries that have experienced earthquakes centered at a depth of 400km or deeper. Your query will use the Earthquakes table in the sub-query, and Nations table in the outer query.
A friend is working on a project looking at earthquake hazards around the world. She requires a table that lists all countries, their continent and the average magnitude earthquake by country. This query will need to access data from the Nations and Earthquakes tables
You want to find out the 2017 population of the biggest city for every country in the world. You can get this information from the Earthquakes database with the Nations table as the outer query and Cities table in the sub-query.
create this using: [1] correlated sub-query
[2] inner join
Presence and Absence
Checking the presence or absence of data is common task of a data scientist.
INTERSECT and EXCEPT
INTERSECT is one of the easier and more intuitive methods used to check if data in one table is present in another
EXCEPT does the opposite of INTERSECT. It is used to check if data, present in one table, is absent in another.
Practice
You want to know which, if any, country capitals are listed as the nearest city to recorded earthquakes. You can get this information by comparing the Nations table with the Earthquakes table.
You want to know which countries have no recorded earthquakes. You can get this information by comparing the Nations table with the Earthquakes table.
Alternative methods
EXISTS filter the outer query when there is a match of data between the outer query and a sub-query It either evaluates for true or false on a match,
IN works the similar way with sub-queries, however we need to specifiy the columns, to match on, in the WHERE filter condition of the outer query, and the SELECT statement of the sub-query
EXISTS VS IN They are both the same, but with small difference the sub-query with EXISTS looks like a corrleated sub-query which we said previously can be inefficent (technically it's a correlated sub-query )
BUT,
EXISTSwilll stop searching the sub-query when the condition is TRUEINcollects all the results from a sub-query before passing to the outer query! which is potentially slowertip: consider using
EXISTSinstead ofINwith a sub-query
NOT EXIST and NOT IN are the opposite, they check the presence of data in one table that is absent in another!
NOT IN and NULLS there's one major issue using NOT IN with null values
if the columns in the sub-query being evaluated for a non-match, contain null values, no result are returned
to handle this behaviour, the sub-query needs IS NOT NULL
tip: consider using
not existsinstead ofnot inbecause of the nullish value misbehaviour!
Practice
You want to know which country capitals have never been the closest city to recorded earthquakes. You decide to use NOT IN to compare Capital from the Nations table, in the outer query, with NearestPop, from the Earthquakes table, in a sub-query.
Alternative methods 2
inner join and left outer join
An exclusive LEFT OUTER JOIN can be used to check for the presence of data in one table that is absent in another table.
Adv and disadv of all methods
Query performance tuning
in this section we're going to discuss:
how to quantify and measure performance
some performance techniques to use for performance tunning
Time statistics
statistics time -> from slides
cpu time: slides
elapsed time; total duration of the query from Execution to returning the complete results back to us
Elapsed time vs cpu time
Page read statistics
Another way of measureing performace is by measuring the amount of
the output will be look like
we've read 32 data pages from memory to complete the query!
Awesome! Logical reads are a measure of the number of the 8-kilobyte pages read from memory to process and return the results of your query. In general, the more pages that need to be read the slower your query will run.
indexes
Indexes are structures added to the table to improve speed of accessing data from a table.
used to locate data quickly without having to scan the entire table. this makes them particulary useful for improving performance of Queries with filter conditions they are applied to a table columns and can be addd at any time.
the two most common table indexes are Clustered and Nonclustered
Clusterd and Nonclustered Indexes
Clusterd Index: A good analogy for a clustered index is a dictionary where words are stored "alphabetically".
Clustered indexes reducee the number of data page reads by a query which helps speed up search operations.
A table can only contain one Clustered index.
Non Clustered Index: A good analogy is a text book with an index at the back
Data in the book is unordered and the index at the back indicated the page number containing a search condition.
Another layer in the index structure contains ordered pointers to the data pages.
A table can contain more than one Nonclustered index.
It's common used to improve table insert and update operations.
Clustered Index: B-tree structure
A clustred index creates what is called a B-tree structure on a table.
Root node
branches the nodes
leaves
[-] The root node: the root node contains ordered pointers to branch nodes which in turn contains orderd pointers to page nodes.
the page node level contains all the 8 kilobyte data pages from the table with the data physically ordered by the columns(s) with the clustred index.
Great job! A clustered index will reduce the number of logical reads because the index will direct the query to the table data pages that meet the filter condition. Without a clustered index, all pages are scanned.
Execution plans
Last updated