Introduction to Analytic Functions (Part 1)

Analytic functions have been part of Oracle for a very long time now – ever since 8i back in 1999. Analytic functions are an ANSI/ISO standard, and so you’ll find that they are similarly-implemented across a number of compliant databases. (This SQL Server article on “window functions” from sister site, Simple Talk, could very well have been talking about Oracle.) Analytic

Analytic functions have been part of Oracle for a very long time now – ever since 8i back in 1999. Analytic functions are an ANSI/ISO standard, and so you’ll find that they are similarly-implemented across a number of compliant databases. (This SQL Server article on “window functions” from sister site, Simple Talk, could very well have been talking about Oracle.)

Analytic functions extend the power of SQL, enabling you carry out rather complex analysis of your data alongside your plain queries. There is nothing that analytic functions do that you could not achieve with complex self-joins and subqueries or PL/SQL. But you’ll find that often with a single line of an analytic function you can save yourself a hundred lines of code and a migraine.

But I need to be clearer. And a good place to start is with aggregate functions. Aggregate functions – like MAX, AVG, COUNT, SUM and the ilk – are great for presenting a summary view of your data in a single row or, if paired with a GROUP BY clause, in a relatively small number of records. But if all we had was aggregate functions we would constantly be forced to make a choice: view the raw data, or view the aggregated summaries?

Analytic functions, on the other hand, allow us to have our cake and eat it too. Using analytic functions we can, over multiple rows, view our data and pair it with a running analysis. To use a sports analogy, the difference between analytic functions and aggregate functions is the difference between watching the game or simply reading the match report.

But that’s enough about eating cake and watching football; I should give you an example.

Here’s an ordinary, everyday aggregate function that tells us the average salary at our small company:

AVG(SAL)
2073.21

However, if we asked an aggregate function to show us how the average salary has changed over the years with each new member of staff it would probably burst into tears and run off to suck its thumb in the corner. We need an analytic function for that.

Name JOB HIREDATE SALARY RUNNING_AVERAGE
SMITH CLERK 12/17/1980 800 800
ALLEN SALESMAN 2/20/1981 1600 1200
WARD SALESMAN 2/22/1981 1250 1216.67
JONES MANAGER 4/2/1981 2975 1656.25
BLAKE MANAGER 5/1/1981 2850 1895
CLARK MANAGER 6/9/1981 2450 1987.5
TURNER SALESMAN 9/8/1981 1500 1917.86
MARTIN SALESMAN 9/28/1981 1250 1834.38
KING PRESIDENT 11/17/1981 5000 2186.11
FORD ANALYST 12/3/1981 3000 2147.73
JAMES CLERK 12/3/1981 950 2147.73
MILLER CLERK 1/23/1982 1300 2077.08
SCOTT ANALYST 12/9/1982 3000 2148.08
ADAMS CLERK 1/12/1983 1100 2073.21

Let’s dissect our analytic function and see what it’s made of. Here’s the general syntax:

The OVER keyword is mandatory; however, the other ingredients of the analytic function soup – PARTITION BY, ORDER BY, and the window clause – are powerful but (in many cases) optional. They tell the analytic function how to slice and dice the results. If, in our query above, we left the ORDER BY out, we’d get a very different result.

Name JOB HIREDATE SALARY AVERAGE
SMITH CLERK 12/17/1980 800 2073.21
ALLEN SALESMAN 2/20/1981 1600 2073.21
WARD SALESMAN 2/22/1981 1250 2073.21
JONES MANAGER 4/2/1981 2975 2073.21
BLAKE MANAGER 5/1/1981 2850 2073.21
CLARK MANAGER 6/9/1981 2450 2073.21
TURNER SALESMAN 9/8/1981 1500 2073.21
MARTIN SALESMAN 9/28/1981 1250 2073.21
KING PRESIDENT 11/17/1981 5000 2073.21
FORD ANALYST 12/3/1981 3000 2073.21
JAMES CLERK 12/3/1981 950 2073.21
MILLER CLERK 1/23/1982 1300 2073.21
SCOTT ANALYST 12/9/1982 3000 2073.21
ADAMS CLERK 1/12/1983 1100 2073.21

Rather than give us the running average over time that we’re looking for, it has returned the current average against each row. And while that might be exactly what you want in some circumstances, it does illustrate the importance of those three options. So let’s take a look at them.

ORDER BY

You will have noticed that we had 2 order by statements in our query. The order by used by the analytic function is completely separate from that of the main query, and can be slightly – or totally – different.

The syntax is, as you may expect:

As I suggested earlier, the analytical ORDER BY clause is often optional, but not always so. For certain functions it is crucial. Take the LEAD and LAG functions, for example. Unlike AVG from our previous examples, these functions aren’t aggregate functions moonlighting as analytic functions. They are purely analytic functions, and cannot be used in any other context.

The LEAD function returns the value based on a row at an offset from the current row. Or, in plainspeak, it enables you return a value based on a record one (or more) rows after the current one.

The LAG function peers in the opposite direction from the LEAD function. And since their very existence is based on the values of other rows, the ORDER BY clause is mandatory when using them.

If, for example, you wish to view not only staff salaries, but also the salary of the person hired after them.

ENAME JOB HIREDATE SAL Next Salary
SMITH CLERK 12/17/1980 800 1600
ALLEN SALESMAN 02/20/1981 1600 1250
WARD SALESMAN 02/22/1981 1250 2975
JONES MANAGER 04/02/1981 2975 2850
BLAKE MANAGER 05/01/1981 2850 2450

Other functions for which the order by clause is mandatory include: FIRST, FIRST VALUE, LAST, LAST VALUE, ROW_NUMBER, RANK and DENSE_RANK. We’ll talk about them in the next in this series of articles.

PARTITION BY

Aaargh, I’ve just realised that I’m doing this all wrong, in totally the wrong order. That’s because, even though the ORDER BY and the PARTITION BY clauses are (often) optional, if they are used together the partition by must come before the order by.

Someone must have bought a new thesaurus on the day they named the partition by clause. Because, more or less, it’s just a way to group the data used by the analytic function, to carve up the recordset.

If, for example, you weren’t interested in the average salary across the company, but wanted to compare each employee’s salary to the average salary within her department, then you’ll have call to partition the data used by your analytic function.

ENAME JOB HIREDATE SAL DEPTNO Dept Average
CLARK MANAGER 06/09/1981 2450 10 2916.67
MILLER CLERK 01/23/1982 1300 10 2916.67
KING PRESIDENT 11/17/1981 5000 10 2916.67
FORD ANALYST 12/03/1981 3000 20 2175
SCOTT ANALYST 12/09/1982 3000 20 2175
JONES MANAGER 04/02/1981 2975 20 2175
SMITH CLERK 12/17/1980 800 20 2175
ADAMS CLERK 01/12/1983 1100 20 2175

However, we can slice and dice the data even further by combining the partition by clause with the order by clause. And so, to find out how the average salary changed by department with each new hire, we’d need a query along these lines:

ENAME JOB HIREDATE SAL DEPTNO Dept Average
CLARK MANAGER 06/09/1981 2450 10 2450
KING PRESIDENT 11/17/1981 5000 10 3725
MILLER CLERK 01/23/1982 1300 10 2916.67
SMITH CLERK 12/17/1980 800 20 800
JONES MANAGER 04/02/1981 2975 20 1887.5
FORD ANALYST 12/03/1981 3000 20 2258.33
SCOTT ANALYST 12/09/1982 3000 20 2443.75
ADAMS CLERK 01/12/1983 1100 20 2175

A second example might make the power of the partition by-order by pairing more obvious. If, for example, you wanted to list all the employees in the company alongside the name of the person hired in their department right before them, you might use the following:

ENAME HIREDATE DEPTNO Previous Hire
CLARK 06/09/1981 10
KING 11/17/1981 10 CLARK
MILLER 01/23/1982 10 KING
SMITH 12/17/1980 20
JONES 04/02/1981 20 SMITH
FORD 12/03/1981 20 JONES
SCOTT 12/09/1982 20 FORD
ADAMS 01/12/1983 20 SCOTT
Windowing

Windowing is a way to further sub-partition the recordset operated upon by an analytic function. Look at it this way: without the partition by clause, we now know that the analytic function will be based upon the complete recordset returned by our query; with the partition by clause, it is only based on a subset of that data. Well, with the windowing clause we can tell it to only look at a subset of that subset.

I know what I’ll do. I’ll give you an example that shows what the windowing clause does. And then afterwards we can talk a little more about its syntax and rules.

Let’s say we wanted to know the sum of each employee’s salary and those of everyone hired within his department as long as they were hired no more than 60 days earlier than him. I don’t know why we’d want to know that, but let’s pretend that we do. ‘Cos that’s where windowing comes in.

ENAME HIREDATE SAL DEPTNO Total Salary Total salary by dept Total dept salary by date Sum dept salary over 60 days
CLARK 06/09/1981 2450 10 29025 8750 2450 2450
KING 11/17/1981 5000 10 29025 8750 7450 5000
MILLER 01/23/1982 1300 10 29025 8750 8750 1300
SMITH 12/17/1980 800 20 29025 10875 800 800
JONES 04/02/1981 2975 20 29025 10875 3775 2975
FORD 12/03/1981 3000 20 29025 10875 6775 3000
SCOTT 12/09/1982 3000 20 29025 10875 9775 3000
ADAMS 01/12/1983 1100 20 29025 10875 10875 4100
ALLEN 02/20/1981 1600 30 29025 9400 1600 1600
WARD 02/22/1981 1250 30 29025 9400 2850 2850
BLAKE 05/01/1981 2850 30 29025 9400 5700 2850
TURNER 09/08/1981 1500 30 29025 9400 7200 1500
MARTIN 09/28/1981 1250 30 29025 9400 8450 2750
JAMES 12/03/1981 950 30 29025 9400 9400 950

Windows are dynamic in nature, and are always relative to the current row. In our example above, the window is always 60 days prior to the current hire date.

However, RANGE … PRECEDING is not the only way we can employ windows. Indeed the available syntax options can seem as jumbled as a tangle of spaghetti as this image from the Oracle Docs site shows.

Windowing Clause

However, here’s what you need to know.

  • Your window can be based upon ROWS or a RANGE. If your window is based on ROWS, this means that it is determined by the rows in your query’s partition; RANGE means it is based on a logical offset. In our example, the range was 60 days.
  • UNBOUNDED PRECEDING says your window starts with the very first row of your partition. UNBOUNDED FOLLOWING says it goes all the way to the end of the partition.
  • CURRENT ROW means … well, it’s pretty obvious, right? I won’t insult your intelligence by telling you.

Having said all of that, what’s actually interesting is the value_expr. Outside of UNBOUNDED PRECEDING|FOLLOWING and CURRENT ROW, this is the main way we express the size of our window.

If your window is based on ROWS, the value_expr must be a positive number; for RANGE, it can also be an interval literal. When using a RANGE, as in our example, you can only have one expression in your ORDER BY as your value_expr will reference it. (In our example, RANGE 60 PRECEDING knew our window was 60 days wide because our partition was ordered by a date, hiredate.)

Restrictions

There are a few things that analytic functions cannot do.

  1. Analytic functions are computed after the main query’s FROM, WHERE and GROUP BY clauses. This means, regrettably, that they cannot be part of your WHERE clause, and must live in the SELECT. Obviously, there is nothing to stop you from plugging them into your WHERE clause as part of a subquery or a WITH statement, but that’s the only way around this restriction.
  2. You cannot nest analytic functions within another analytic function. So stop trying to be too clever.
  3. You cannot use column aliases or column positions in an analytic function’s order by clause.
Conclusion

Could you survive without analytic functions? Yes, totally. It’s the same way you could survive without chocolate or wine. You could – but why would you want to?

As I said at the start of this article, you could achieve every result that an analytic function will give you with the clever use of self-joins and subqueries. But it’ll be cumbersome, difficult and inefficient.

In the next part of this series, we will talk about the various analytic functions and how to use them.