07 September 2015

36356 views

46 0

Introduction to Analytic Functions (Part 1)

Analytic functions have been part of Oracle for a very long time now – ever since 8i back in 1999. Analytic functions are an ANSI/ISO standard, and so you’ll find that they are similarly-implemented across a number of compliant databases. (This SQL Server article on “window functions” from sister site, Simple Talk, could very well have been talking about Oracle.) Analytic

Analytic functions extend the power of SQL, enabling you carry out rather complex analysis of your data alongside your plain queries. There is nothing that analytic functions do that you could not achieve with complex self-joins and subqueries or PL/SQL. But you’ll find that often with a single line of an analytic function you can save yourself a hundred lines of code and a migraine.

But I need to be clearer. And a good place to start is with aggregate functions. Aggregate functions – like MAX, AVG, COUNT, SUM and the ilk – are great for presenting a summary view of your data in a single row or, if paired with a GROUP BY clause, in a relatively small number of records. But if all we had was aggregate functions we would constantly be forced to make a choice: view the raw data, or view the aggregated summaries?

Analytic functions, on the other hand, allow us to have our cake and eat it too. Using analytic functions we can, over multiple rows, view our data and pair it with a running analysis. To use a sports analogy, the difference between analytic functions and aggregate functions is the difference between watching the game or simply reading the match report.

But that’s enough about eating cake and watching football; I should give you an example.

Here’s an ordinary, everyday aggregate function that tells us the average salary at our small company:

1 2	SELECT AVG(sal) FROM emp;

AVG(SAL)

2073.21

However, if we asked an aggregate function to show us how the average salary has changed over the years with each new member of staff it would probably burst into tears and run off to suck its thumb in the corner. We need an analytic function for that.

SELECT ename, job, hiredate, sal,

AVG(sal) OVER (ORDER BY hiredate) AS running_average

FROM emp

ORDER BY hiredate;

Name	JOB	HIREDATE	SALARY	RUNNING_AVERAGE
SMITH	CLERK	12/17/1980	800	800
ALLEN	SALESMAN	2/20/1981	1600	1200
WARD	SALESMAN	2/22/1981	1250	1216.67
JONES	MANAGER	4/2/1981	2975	1656.25
BLAKE	MANAGER	5/1/1981	2850	1895
CLARK	MANAGER	6/9/1981	2450	1987.5
TURNER	SALESMAN	9/8/1981	1500	1917.86
MARTIN	SALESMAN	9/28/1981	1250	1834.38
KING	PRESIDENT	11/17/1981	5000	2186.11
FORD	ANALYST	12/3/1981	3000	2147.73
JAMES	CLERK	12/3/1981	950	2147.73
MILLER	CLERK	1/23/1982	1300	2077.08
SCOTT	ANALYST	12/9/1982	3000	2148.08
ADAMS	CLERK	1/12/1983	1100	2073.21

Let’s dissect our analytic function and see what it’s made of. Here’s the general syntax:

1	FUNCTION(args) OVER ([PARTITION BY <…>] [ORDER BY <…>][window_clause])

The OVER keyword is mandatory; however, the other ingredients of the analytic function soup – PARTITION BY, ORDER BY, and the window clause – are powerful but (in many cases) optional. They tell the analytic function how to slice and dice the results. If, in our query above, we left the ORDER BY out, we’d get a very different result.

SELECT ename, job, hiredate, sal,

AVG(sal) OVER () "average"

FROM emp

ORDER BY hiredate;

Name	JOB	HIREDATE	SALARY	AVERAGE
SMITH	CLERK	12/17/1980	800	2073.21
ALLEN	SALESMAN	2/20/1981	1600	2073.21
WARD	SALESMAN	2/22/1981	1250	2073.21
JONES	MANAGER	4/2/1981	2975	2073.21
BLAKE	MANAGER	5/1/1981	2850	2073.21
CLARK	MANAGER	6/9/1981	2450	2073.21
TURNER	SALESMAN	9/8/1981	1500	2073.21
MARTIN	SALESMAN	9/28/1981	1250	2073.21
KING	PRESIDENT	11/17/1981	5000	2073.21
FORD	ANALYST	12/3/1981	3000	2073.21
JAMES	CLERK	12/3/1981	950	2073.21
MILLER	CLERK	1/23/1982	1300	2073.21
SCOTT	ANALYST	12/9/1982	3000	2073.21
ADAMS	CLERK	1/12/1983	1100	2073.21

Rather than give us the running average over time that we’re looking for, it has returned the current average against each row. And while that might be exactly what you want in some circumstances, it does illustrate the importance of those three options. So let’s take a look at them.

ORDER BY

You will have noticed that we had 2 order by statements in our query. The order by used by the analytic function is completely separate from that of the main query, and can be slightly – or totally – different.

The syntax is, as you may expect:

1	ORDER BY <expression> [ASC\|DESC] NULLS [FIRST\|LAST]

As I suggested earlier, the analytical ORDER BY clause is often optional, but not always so. For certain functions it is crucial. Take the LEAD and LAG functions, for example. Unlike AVG from our previous examples, these functions aren’t aggregate functions moonlighting as analytic functions. They are purely analytic functions, and cannot be used in any other context.

The LEAD function returns the value based on a row at an offset from the current row. Or, in plainspeak, it enables you return a value based on a record one (or more) rows after the current one.

The LAG function peers in the opposite direction from the LEAD function. And since their very existence is based on the values of other rows, the ORDER BY clause is mandatory when using them.

If, for example, you wish to view not only staff salaries, but also the salary of the person hired after them.

SELECT ename, job, hiredate, sal,

LEAD (sal) OVER (order by hiredate ) "Next Salary"

FROM emp

ORDER BY hiredate;

ENAME	JOB	HIREDATE	SAL	Next Salary
SMITH	CLERK	12/17/1980	800	1600
ALLEN	SALESMAN	02/20/1981	1600	1250
WARD	SALESMAN	02/22/1981	1250	2975
JONES	MANAGER	04/02/1981	2975	2850
BLAKE	MANAGER	05/01/1981	2850	2450

Other functions for which the order by clause is mandatory include: FIRST, FIRST VALUE, LAST, LAST VALUE, ROW_NUMBER, RANK and DENSE_RANK. We’ll talk about them in the next in this series of articles.

PARTITION BY

Aaargh, I’ve just realised that I’m doing this all wrong, in totally the wrong order. That’s because, even though the ORDER BY and the PARTITION BY clauses are (often) optional, if they are used together the partition by must come before the order by.

Someone must have bought a new thesaurus on the day they named the partition by clause. Because, more or less, it’s just a way to group the data used by the analytic function, to carve up the recordset.

If, for example, you weren’t interested in the average salary across the company, but wanted to compare each employee’s salary to the average salary within her department, then you’ll have call to partition the data used by your analytic function.

1 2	SELECT ename, job, hiredate, sal, deptno, AVG(sal) OVER (PARTITION BY deptno) AS "Dept Average" FROM emp;

ENAME	JOB	HIREDATE	SAL	DEPTNO	Dept Average
CLARK	MANAGER	06/09/1981	2450	10	2916.67
MILLER	CLERK	01/23/1982	1300	10	2916.67
KING	PRESIDENT	11/17/1981	5000	10	2916.67
FORD	ANALYST	12/03/1981	3000	20	2175
SCOTT	ANALYST	12/09/1982	3000	20	2175
JONES	MANAGER	04/02/1981	2975	20	2175
SMITH	CLERK	12/17/1980	800	20	2175
ADAMS	CLERK	01/12/1983	1100	20	2175

However, we can slice and dice the data even further by combining the partition by clause with the order by clause. And so, to find out how the average salary changed by department with each new hire, we’d need a query along these lines:

SELECT ename, job, hiredate, sal, deptno,

AVG(sal) OVER (PARTITION BY deptno ORDER BY hiredate) AS "Dept Average"

FROM emp

ORDER BY hiredate;

ENAME	JOB	HIREDATE	SAL	DEPTNO	Dept Average
CLARK	MANAGER	06/09/1981	2450	10	2450
KING	PRESIDENT	11/17/1981	5000	10	3725
MILLER	CLERK	01/23/1982	1300	10	2916.67
SMITH	CLERK	12/17/1980	800	20	800
JONES	MANAGER	04/02/1981	2975	20	1887.5
FORD	ANALYST	12/03/1981	3000	20	2258.33
SCOTT	ANALYST	12/09/1982	3000	20	2443.75
ADAMS	CLERK	01/12/1983	1100	20	2175

A second example might make the power of the partition by-order by pairing more obvious. If, for example, you wanted to list all the employees in the company alongside the name of the person hired in their department right before them, you might use the following:

SELECT ename, hiredate, deptno,

LAG (ename) OVER (PARTITION BY deptno ORDER BY hiredate) AS "Previous Hire"

FROM emp

ORDER BY deptno, hiredate;

ENAME	HIREDATE	DEPTNO	Previous Hire
CLARK	06/09/1981	10
KING	11/17/1981	10	CLARK
MILLER	01/23/1982	10	KING
SMITH	12/17/1980	20
JONES	04/02/1981	20	SMITH
FORD	12/03/1981	20	JONES
SCOTT	12/09/1982	20	FORD
ADAMS	01/12/1983	20	SCOTT

Windowing

Windowing is a way to further sub-partition the recordset operated upon by an analytic function. Look at it this way: without the partition by clause, we now know that the analytic function will be based upon the complete recordset returned by our query; with the partition by clause, it is only based on a subset of that data. Well, with the windowing clause we can tell it to only look at a subset of that subset.

I know what I’ll do. I’ll give you an example that shows what the windowing clause does. And then afterwards we can talk a little more about its syntax and rules.

Let’s say we wanted to know the sum of each employee’s salary and those of everyone hired within his department as long as they were hired no more than 60 days earlier than him. I don’t know why we’d want to know that, but let’s pretend that we do. ‘Cos that’s where windowing comes in.

SELECT ename, hiredate, sal, deptno,

SUM(sal) OVER () AS "Total Salary",

SUM(sal) OVER (PARTITION BY deptno) AS "Total salary by dept",

SUM(sal) OVER (PARTITION BY deptno ORDER BY hiredate) AS "Total dept salary by date",

SUM(sal) OVER (PARTITION BY deptno ORDER BY hiredate RANGE 60 PRECEDING) AS "Sum dept salary over 60 days"

FROM emp

ORDER BY deptno, hiredate;

ENAME	HIREDATE	SAL	DEPTNO	Total Salary	Total salary by dept	Total dept salary by date	Sum dept salary over 60 days
CLARK	06/09/1981	2450	10	29025	8750	2450	2450
KING	11/17/1981	5000	10	29025	8750	7450	5000
MILLER	01/23/1982	1300	10	29025	8750	8750	1300
SMITH	12/17/1980	800	20	29025	10875	800	800
JONES	04/02/1981	2975	20	29025	10875	3775	2975
FORD	12/03/1981	3000	20	29025	10875	6775	3000
SCOTT	12/09/1982	3000	20	29025	10875	9775	3000
ADAMS	01/12/1983	1100	20	29025	10875	10875	4100
ALLEN	02/20/1981	1600	30	29025	9400	1600	1600
WARD	02/22/1981	1250	30	29025	9400	2850	2850
BLAKE	05/01/1981	2850	30	29025	9400	5700	2850
TURNER	09/08/1981	1500	30	29025	9400	7200	1500
MARTIN	09/28/1981	1250	30	29025	9400	8450	2750
JAMES	12/03/1981	950	30	29025	9400	9400	950

Windows are dynamic in nature, and are always relative to the current row. In our example above, the window is always 60 days prior to the current hire date.

However, RANGE … PRECEDING is not the only way we can employ windows. Indeed the available syntax options can seem as jumbled as a tangle of spaghetti as this image from the Oracle Docs site shows.

However, here’s what you need to know.

Your window can be based upon ROWS or a RANGE. If your window is based on ROWS, this means that it is determined by the rows in your query’s partition; RANGE means it is based on a logical offset. In our example, the range was 60 days.
UNBOUNDED PRECEDING says your window starts with the very first row of your partition. UNBOUNDED FOLLOWING says it goes all the way to the end of the partition.
CURRENT ROW means … well, it’s pretty obvious, right? I won’t insult your intelligence by telling you.

Having said all of that, what’s actually interesting is the value_expr. Outside of UNBOUNDED PRECEDING|FOLLOWING and CURRENT ROW, this is the main way we express the size of our window.

If your window is based on ROWS, the value_expr must be a positive number; for RANGE, it can also be an interval literal. When using a RANGE, as in our example, you can only have one expression in your ORDER BY as your value_expr will reference it. (In our example, RANGE 60 PRECEDING knew our window was 60 days wide because our partition was ordered by a date, hiredate.)

Restrictions

There are a few things that analytic functions cannot do.

Analytic functions are computed after the main query’s FROM, WHERE and GROUP BY clauses. This means, regrettably, that they cannot be part of your WHERE clause, and must live in the SELECT. Obviously, there is nothing to stop you from plugging them into your WHERE clause as part of a subquery or a WITH statement, but that’s the only way around this restriction.
You cannot nest analytic functions within another analytic function. So stop trying to be too clever.
You cannot use column aliases or column positions in an analytic function’s order by clause.

Conclusion

Could you survive without analytic functions? Yes, totally. It’s the same way you could survive without chocolate or wine. You could – but why would you want to?

As I said at the start of this article, you could achieve every result that an analytic function will give you with the clever use of self-joins and subqueries. But it’ll be cumbersome, difficult and inefficient.

In the next part of this series, we will talk about the various analytic functions and how to use them.

Introduction to Analytic Functions (Part 1)

ORDER BY

PARTITION BY

Windowing

Restrictions

Conclusion

Subscribe for more articles

Rate this article

David Njoku

Related articles

A Guide to INSERT, UPDATE, and DELETE Statements in Oracle