The pitfalls of O(N)
This article implements the Beginner pattern.
O(N) is a lump term for asymptotic notations (such as o(N) and T(n)), used to characterize the time necessary to run an algorithm (or the space occupied by a data structure) with respect to an independent variable N representing the scale of the data.
Depending on your field, you may encounter every day algorithmic problems, or other optimizations: design and related maintainability, or user experience. Computational complexity is a very little aspect that is focused on because it's easy to measure, but not all that counts can be counted.
O(N) is interesting to study theoretical properties of algorithm, and can explain why the default sorting algorithms proposed by languages and libraries are mergesort and quicksort. However, it is far from a sufficient metric to evaluate performance and in this article I will list some of the pitfalls of this notation.
Asymptotic notations are produced by taking the limit of some expressions as N goes to infinity, to eliminate lower-order terms. However, there is no infinite N in real systems, and the number of items an algorithm treats is often a number in the dozens or hundreds.
When N is small or even bounded by a number in the hundreds, O(N) considerations aren't usually important, unless you're considering operations research problems such as the traveling salesman, whose complexity raises with O(N!) for naive algorithms.
Moreover, since N is not infinite, there may be constants attached to the complexity expressions that are more important (up to a point) with respect to the variable terms. An estimation of 10000*N is heavier than 10*N^2 until N reaches 1000 or more.
Whenever we are treating larger amounts of data, we are usually putting them into a relational or NoSQL database for persistence and optimization (saving main memory while sacrificing access speed.)
The database itself, when indexes are used correctly, already has its own data structure and is capable of performing optimizations. For example, sorting speed consideration for a ORDER BY query on an indexed field are not important as the index is already a ordered data structure; only the incorrect use of indexes can get us to an operation that is more than O(N) in time.
O(n) calculations should always be accompanied by a probability distribution, since an implementation is going to encounter not only one input of a particular size in its lifecycle, but a large number of them that may correspond to different performances. O(n) may refer to:
- the worst case (the single input that corresponds to the biggest time or space usage)
- the medium case (more significant)
- the best case (important when it happens relatively often, such as in the case of almost-sorted data)
A famous example is the simplex algorithm, the most diffused solution to linear programming problems. It is a non-polynomial algorithm in the worst case, yet it beats other polynomial algorithm that solve the same problems almost every time. The reason is the worst case for the simplex algorithm is a pathological case that has to be built in order to fool the algorithm; this pathological cases have no impact or real occurrence whatsoever: the irony of a non-polynomial algorithm being the most performant one in all practical cases.
So the worst and best cases, along with the spectrum between them, should be weighed by probabilities just like we weigh potential changes in design. We usually don't value database or SQL abstraction layers in some products because we estimate a change in the database as being not possible; the same goes for certain kinds of inputs to our implementations of known algorithms.
In everyday work, profiling is the only way to find out where performance problems, not focusing on implementing efficient algorithms. Even while building more complex performance models such as queuing network, the parameters of the model must be measured on real machines.
- What is the iowait of your machines under a reproducible workload? Are network calls?
- Is there a part of it that is CPU bound?
- Have you considered the overhead of your language?
It is hypocrite to propose both to measure O(n) complexity for performance's sake and to allow RMI or other fake-transparent network calls on the other side. Often performance problems come from these forms of technical debt and complex solutions, not from problematic algorithms.
Thus, when you are choosing an algorithm consider its theoretical properties; but shorten the feedback loop and experiment with your implementation on real data as soon as possible to contain the risk of performance problems down the way.