How to learn Analytic or Windows functions in MySQL for data science – Part 1?
Data Science is in too much Hype these days. Every organization is now a day looking for data scientists. The Data science role includes the collection, retrieval, analyzing, and representation of data in pictorial format, including statistical information to the users.
The need for data scientists has been rise due to an increase in the use of the internet. Every layman use data these days. Which causes an increase in data in zillions and trillions.
So, here are some most useful and important Nonaggregate Windows or Analytic functions MySql functions.
- Nonaggregate Windows functions
- FIRST_VALUE() will show the first value from the first row of the window frame.
- LAST_VALUE() will show the last value from the last row of the window frame.
- NTH_VALUE() will show the Nth-row value from the window frame.
- CUME_DIST() Cumulative distribution value
- DENSE_RANK() Rank of current row within its partition, without gaps
- LAG() Value of argument from row lagging current row within partition
- LEAD() Value of argument from row leading current row within partition
- NTILE() Bucket number of current row within its partition.
- PERCENT_RANK() Percentage rank value
- RANK() Rank of current row within its partition, with gaps
- ROW_NUMBER() Number of current row within its partition
Windows function Introduction
As we have seen the aggregate grouped functions of MySQL. There are some windows functions available in MySQL to perform calculations on set of rows, e.g SUM() or MIN() or MAX(). Although this grouped aggregate function converts the result of multiple rows in a single row, a window aggregate function applies aggregation on every row from a query, keeping every row intact.
See the below example to check the difference between the two SELECTs statement:
In the first select statement, we have a grouped aggregate but no GROUP BY clause, we have an implicit group containing all rows. The values of order_amount get summed up for the group, and we get a value of 16761.5003 as a result row.
In the second select statement, as you can see, every row of the table appeared in the output, but every record contained the value of the sum of all rows.
The vital difference is the usage of the OVER () syntax after the SUM(order_amount)function. The keyword OVER() shows that it is a window function, as opposed to a grouped aggregate function.
There is empty parentheses after OVER is a window specification.
In this simple example, it is empty. This means default to aggregating the window function over all rows in the result set, so as for the grouped aggregate, we get the value 16761.5003 returned from the window function calls.
Example 2 :
See the difference between both queries with the help of AVG() aggregate function.
In the first select statement, we have a grouped aggregate but no GROUP BY clause, we have an implicit group containing all rows. The values of order_amount get average up for the group, and we get a value of 1676.15003357 as a result row.
In the second select statement, as you can see, every row of the table appeared in the output, but every record has contained the value of the AVERAGE of all rows.
In this sense, a window function can be thought of as just another SQL function, except that its value is based on the value of other rows in addition to the values of the for which it is called, i.e. they function as a window into other rows.
Window functions come in three flavors:
- Aggregate functions i.e SUM(), MIN(), MAX(), COUNT() and AVG().
- Nonaggregate Ranking functions i.e RANK(), DNSE_RANK(), ROW_NUMBER(), NTILE() AND PERCENT_RANK().
- Nonaggregate Value Window functions i.e LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE(), NTH_VALUE(), CUME_DIST().
In this article, we will learn the above nonaggregate window functions. In this for each row from a query, perform a calculation using rows related to that row.
Each window operation in the query is signified by the inclusion of an OVER clause. It helps in how to divide query rows into groups for processing by the window function
Some window functions permit a null_treatment clause that specifies how to handle NULL values when calculating results. This clause is optional. It is part of the SQL standard, but the MySQL implementation permits only RESPECT NULLS (which is also the default). This means that NULL values are considered when calculating results. IGNORE NULLS is parsed but produces an error.
FIRST_VALUE(expression) [null_treatment] over_clause
Returns the data/value of the expression from the first row of the window frame.
Each window operation in the query is signified by the inclusion of an OVER clause. It helps in how to divide query rows into groups for processing by the window function
The following query demonstrates how to use FIRST_VALUE() windows function
Example 1:
mysql> select client_firstname, client_city, order_amount, first_value(order_amount) over() from tbl_toysorderdetails ;
The above example will select the first value i.e 280.400 from the window frame. We have used the over clause with it because it will not work without this clause.
Example 2:
The above example will first apply for the ORDER BY DESC clause then it is limited to fetch 5 rows only. It will sort the rows on the basis of order_amount in DESCENDING format. After this FIRST VALUE function will work. That’s why it has selected the first value i.e 4578.8901 from the window frame.
Example 3:
mysql> select client_firstname as 'Name', order_amount as 'Amount', first_value(order_amount) over(), min(order_amount) over() from tbl_toysorderdetails order by order_amount;
In the above query, we have sorted the rows in descending format for order_amount. Then we have can see both the first_value() and min() function has returned the same value.
LAST_VALUE(expression) [null_treatment] over_clause
Returns the data/value of the expression from the last row of the window frame.
Each window operation in the query is signified by the inclusion of an OVER clause. It helps in how to divide query rows into groups for processing by the window function
The following query demonstrates how to use LAST_VALUE() windows function
Example 1:
mysql> select client_firstname, client_city, order_amount, last_value(order_amount) over() from tbl_toysorderdetails ;
The above example will select the last value i.e 4500.6001 from the window frame. We have used the over clause with it because it will not work without this clause.
Example 2:
mysql> select client_firstname as 'Name', order_amount as 'Amount', last_value(order_amount) over(), max(order_amount) over() from tbl_toysorderdetails order by order_amount;
The above example will first apply to the ORDER BY clause. It will sort the rows. After this LAST VALUE function will work. You can see both functions i.e. last_value() and max() have selected the same value i.e 4578.8901 from the window frame.
Example 3 : Using PARTITION with an OVER() clause – Getting last value by CITY.
mysql> select client_firstname as 'Name', order_amount as 'Amount', client_city as 'City', last_value(order_amount) over(partition by client_city) as 'last value' from tbl_toysorderdetails order by client_city;
In the above example, we can see the rows have been sorted on the basis of client_city. And rows have been partition on its basis. OVER() clause is now contained the PARTITION clause in it, to partition the rows on a city basis. This will help in getting the last_value on the basis of the city.
As you can see in the City
Banaras, Kolkata, Mumbai has one row. So, the last_value is 560.8000, 4500.6001,4578.8901.
Delhi has 3 rows. So, the last_value is 2400.5601.
Faridabad, Lucknow has two rows with the last values 2578.8000, 160.0000 respectively.
NTH_VALUE(expression, N) [from_first_last][null_treatment] over_clause
Returns the data/value of expression from the N-th row of the window frame. N should be a literal positive integer value. If no row is found with the number then it will return NULL value.
Each window operation in the query is signified by the inclusion of an OVER clause. It helps in how to divide query rows into groups for processing by the window function
from_first_last is part of the MySQL standard, but the MySQL implementation permits only FROM FIRST (which is also the default). It refers that all calculation starts from the first row of the window. See example 1.
FROM LAST
is parsed, but this will gives an error. So, to get the same effect as FROM LAST or we can say, to start calculations from the last row of the window. We should use ORDER BY to sort the rows in reverse order. See example 2.
In MySQL 8.0.22 and later, We are not allowed to use NULL
for the row argument of this function.
Example 1: Fetching 5th and 7th order amount value using FROM FIRST.
If you have to fetch the 5th or 7th highest order_amount from the table. You can use the NTH_VALUE(column_name,5) and NTH_VALUE(column_name,7). Below query will show you the same.
mysql> select client_firstname, client_city, order_amount, nth_value(order_amount,5) over() as ‘fifth’, nth_value(order_amount,7) over() as ‘seventh’ from tbl_toysorderdetails order by order_amount;
Example 2: Fetching 5th and 7th order amount value FROM LAST, USING order by order_amount DESC.
mysql> select client_firstname, client_city, order_amount, nth_value(order_amount,5)
over() as ‘fifth’, nth_value(order_amount,7) over() as ‘seventh’ from tbl_toysorderdetails order by order_amount DESC;
Example 3:
The following sql demonstrates FIRST_VALUE(), LAST_VALUE(), and two instances of NTH_VALUE():
mysql> select client_firstname, client_city, order_amount, first_value(order_amount) over w as ‘first’, last_value(order_amount) over w as ‘last’, nth_value(order_amount,2) over w as ‘second’, nth_value(order_amount,3) over w as ‘third’ from tbl_toysorderdetails window w as (partition by client_city order by client_city);
You can see in the above example we have partitioned the rows on the basis of client city. As, We are getting rows sorted on the basis of city .
We have orders from Banaras, Delhi, Faridabad, Lucknow, etc. cities.
For City Banaras, Kolkata, and Mumbai, there is one order. So, the First value as ‘first’, Last value as ‘last’ column has the SAME value. But for the second and third columns, we have NULL values.
For City Delhi,
- The first value is 280.400.
- The last value is 2400.5601.
- For the second column, we have 500.6000.
- For the third column, we have 2400.5601.
For City Faridabad,
- The first value is 400.6000.
- The second and last value is 2578.8000.
- The third value is NULL.
For City Lucknow,
- The first value is 800.2500.
- The second and last value is 160.0000.
- The third value is NULL.
CUME_DIST() over_clause
CUME_DIST() is a window functions. It gives value to rows fluctuate from 0 to 1 i.e cumulative distribuion value within a group of values. returned cumulative distribution is between 0 < cume_DIST() <= 1.
ORDER BY clause is a must in this, to sort the rows in your expected way or format. The partition can be used to divide rows by specific needs.
Approximate formula used by the CUME_DIST()
function is written as:
ROW_NUMBER() / total_rows
Example 1 : CUME_DIST() with row_number()
mysql> select order_amount, client_city,
-> row_number() over (order by order_amount) as ‘row_number()’,
-> CUME_DIST() over (order by order_amount) as ‘cume_dist()’
-> from tbl_toysorderdetails;
The above example will calculate the cumulative distribution for all the rows. In this example, the order_amount is sorted in ascending order from 160.0000 to 4578.8901.
Note : ROW_NUMBER()
function is showing the row numbers for each row.
So, how the CUME_DIST() function performs calculation?
This function will count the total numbers of rows in the resultset.
for the first row, CUME_DIST() function divides 1 by the total number of rows which is 10: 1/10. the result is 0.1 or 10%.
for the first row, CUME_DIST() function divides 2 by the total number of rows which is 10: 2/10. the result is 0.2 or 20%.
The same things will be used in other rows to calculate CUME_DIST().
Example 2 : CUME_DIST() with Partition by Client City
mysql> select order_amount, client_city, row_number() over (PARTITION BY client_city) as ‘row_number()’ , CUME_DIST() over (PARTITION by client_city order by order_amount) as ‘cume_dist()’ from tbl_toysorderdetails;
mysql> select order_amount, client_city, row_number() over (order by
client_city) as ‘row_number()’ , CUME_DIST() over (order by client_city) as ‘cume_dist()’ from tbl_toysorderdetails;
Endnotes
Hope this article helps you bring out more from your dataset. And if you have any favorite MySql function that you find useful or use quite often, do comment below and share your experience!
Do see some important function here for data analysis in Mysql.
Some Useful aggregate and mathematical functions in MYSQL for data science
How to learn Strings and datetime analytic functions in MySql for data science?
See all MySQL String functions MySQL 8 String Functions.
See all PHP string functions PHP String Functions.