SQL Zone is brought to you in partnership with:

Mr. Lott has been involved in over 70 software development projects in a career that spans 30 years. He has worked in the capacity of internet strategist, software architect, project leader, DBA, programmer. Since 1993 he has been focused on data warehousing and the associated e-business architectures that make the right data available to the right people to support their business decision-making. Steven is a DZone MVB and is not an employee of DZone and has posted 135 posts at DZone. You can read more from them at their website. View Full User Profile

SQL Can Be Slow -- Why Do People Doubt This?

10.01.2010
| 8835 views |
  • submit to reddit
Here's a typical problem that results from "SQL Hegemony" -- all data must be in a database, and all access must be via SQL. This can also be called the "SQL Fetish" school of programming.


War Story. On a Data Warehousing project, we had to load and process the organizational hierarchy. SQL doesn't do hierarchies well because they can (and should) involve an join of indefinite depth. One of the DBA's wanted to use a "pure SQL" traversal of the hierarchy.

My opinion was that it was a waste of code. We were writing Java programs. We could -- trivially -- fetch the entire tree into Java objects and work with the hierarchy as a hierarchy.


The DBA finally "won" because of the SQL Hegemony argument -- all access must be in SQL, right? I say "won" because we eventually had to throw all the SQL away and use flat files. A "pure SQL" data warehouse is generally unacceptably slow for loading. Data mart subsets can be done in pure SQL, but loads can't.

Recent Events. "a table called [LOTSADATA] and it has 14.7 million rows. One of the columns in [LOTSADATA] table is BRAND" for which they need to do a select distinct. "The disadvantage of [SELECT DISTINCT] is that the Oracle database engine will do a sort which is an insanely expensive operation.

Question: Are there alternative approaches to obtaining the unique brands ina table?"


Response 1. Duh. Of course there are alternatives. What are you, stupid? You have programming languages. Use them.

Response 2. You're kidding, right? Why ask me? Why not just run it? How hard can it be to benchmark this? What are you, stupid? Seriously.

Response 3. Oh. SQL Hegemony. Folks are actually arguing about the cost of a query and -- it appears -- no one can actually write the eight lines of code required to demonstrate that SELECT ALL is faster than SELECT DISTINCT.

[Sorry for calling you stupid. You're paralyzed by fear, not stupidity. What if SQL isn't the perfect end-all, do-all language? If SQL isn't perfect for all data processing, what other lies have we been living? Is this the end of organized data processing? The collapse of western civilization?

Indeed, I'm repeatedly shocked that the question even comes up. And I'm more shocked that the "appeal to authority" argument has to be used. It's trivial to measure. It appears that it's easier to ask me than to gather data.]

Edit. SQL Hegemony? Yes. Rather than run a demonstration program, written in Java or C# or Python, they argued about the SQL. Doing this with minimalist SQL didn't seem to make anyone's radar. Why not? SQL Hegemony. Rather than consider real alternatives, everyone was reduced to looking for sneaky SQL tricks.

Benchmarking. Here is what I did. It's 5 lines of code for each case. [How hard can this be? Apparently, SQL hegemony makes it impossible for some organizations to do even this.]

def select_distinct():
    q1= db.cursor() 
    q1.execute( "SELECT DISTINCT BRAND FROM LOTSADATA" )
    print q1.fetchall()
    q1.close()

def select_all():
    q2= db.cursor()
    q2.execute( "SELECT ALL BRAND FROM LOTSADATA" )
    print set( q2.fetchall() )
    q2.close()

Notes.

  • I only simulated 100,000 rows. [I don't have the patience to wait for 15 million rows to be created, loaded and queried.]
  • The table only had four columns.
  • I used SQLite3 -- which is mostly in-memory -- and runs much, much faster than Oracle.
  • The select all is not a specious result based on cache being filled; the results are repeatable in any ordering of the queries.

Results
.

select_distinct 0.417096select_all 0.162827

For this data, the SQL SELECT DISTINCT took almost 3x as long as the SELECT ALL. It's just that simple.

Want more speed? Use array fetch features to get more rows in bigger buffers.

Consequences.

This is not rocket science. SQL can be Slow. Don't Argue: Benchmark. Your Mileage May Vary.

SQL databases do locking, transaction management, backup and recovery and a bunch of things well. SQL databases are helpful and necessary. However, SQL isn't always fast.

SQL means Slow Query Language. You've been told.
References
Published at DZone with permission of Steven Lott, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

Dragan Sahpaski replied on Fri, 2010/10/01 - 8:46am

Don't Blame SQL and relational databases that your company staff are not creative. Also SQL is a language. You probably mean that relational databases can be slow. Not SQL. SQL can also be implemented in a non-relational database environment.

Craig Ringer replied on Fri, 2010/10/01 - 9:11am

"I used SQLite3"

... which is a remarkably primitive database. Very good for embedded use on smaller data sets, or when you're willing to do more processing in your code and rely on the database only for simpler work. It's a pretty dumb engine, though, so it doesn't do much for your argument when you've been talking about it being faster to do things in $procedural-language rather than Oracle. It also doesn't help that you failed to demonstrate your faster method for picking out distinct results, though there are indeed faster methods than a sort (and Oracle should use them if tuned correctly with sacrificed goats and chicken gut divination; PostgreSQL does).

As it happens, I agree with you that there are some tasks for which SQL databases are far from ideal. Data warehousing can be a good example, though it's possible to produce column-store based databases that still use SQL and these perform extremely well. A row store SQL database is certainly not great for data warehousing and data mining. (That said, you could've used WITH RECURSIVE, or the Oracle-specific START WITH ... CONNECT BY rather than a multi-self-join for your recursive query. I assume it was a long time ago, before these options were available?).

Anyway: there are jobs that SQL is a poor fit for, I just don't think your argument does a very good job of showing this. So I'll show why your example doesn't hold up on a real database with real data.

I don't have the $lots required for Oracle, and prefer to use PostgreSQL in any case. I'd like to try your test on a somewhat smarter database, so I'll use PostgreSQL running on my local win7 box. You didn't specify the column types or the nature of the dummy data used, nor did you provide the code you used to generate it. On the other hand, for the argument it doesn't matter much, as your argument isn't all that specific to the characteristics of the data. I'll use a 220MB, 420000 row table I have on hand on an idle dev machine. Picking an un-indexed column with a reasonable variety of data in it, I'll compare the time for the database to generate a distinct subset vs the time it takes for a simple seqscan, as per your example.

=> \timing on
=> \o outfile
=> select distinct page_number from booking;
Time: 458.107 ms
=> select page_number from booking;
Time: 450.470 ms

Indeed, the DISTINCT is marginally, trivially faster. Of course, I picked the slowest of three times for DISTINCT and the fastest of three for the straight select; in fact, DISTINCT fared much better with an average of 416ms while the straight seqscan got 460ms.

Once you factor in the extra data transfer of the seqscan, and the time required to identify distinct values ... good luck. PostgreSQL is doing a hash aggregate over a sequential scan of the data, not a sort-and-filter, and you are *not* going to beat it in your code when talking to the database over JDBC/ODBC or even a native protocol.

So, yes, a badly tuned SQL database or a really basic one performs badly. A properly set up and configured database, not so much. That's not to say that there aren't some jobs SQL is bad for or slow for - there are - just that your argument goes nowhere toward demonstrating that.

Benjamin Keil replied on Fri, 2010/10/01 - 12:16pm

There are several ways to model hierarchies in SQL that don't require "a join of indefinte depth."

First, there's Joe Celko's famous "Nested Set" model.  Each node has a left-value and a right-value.  A node N contains another node M just in case N.left < M.left and N.right > M.right.

Another way to go is to concatenate a node's identifier with its ancestors.  So X has sub-nodes XY and XZ, and XY dominates XYQ.

And then some databases support the "WITH RECURSION" query modifier that takes care of the multiple-join for you. 

Endre Varga replied on Sat, 2010/10/02 - 6:01am in response to: Benjamin Keil

Have you ever actually _used_ the models you recommend? I did. Nested Set works fine for some kind of queries but not for others. It is pretty inflexible.

WITH RECURSION does not help much. It may reduce the code you have to write, but it will be just as slow to execute.

Craig Ringer replied on Sun, 2010/10/03 - 9:43pm in response to: Endre Varga

WITH RECURSIVE has always tended to perform better than loading the lot and processing it in memory for me. Perhaps I simply write bad code - or maybe I'm using a fairly smart database engine (PostgreSQL) with appropriate indexes that's configured and tuned to the hardware. Either way, the database is smarter than me, and that's good enough. Occasionally I get better results using a stored procedure running inside the database, where I need logic that's not easily expressed in SQL. I've never encountered a case where it's faster to fetch the data and process it in my code than it is to process it in the DB in a stored procedure. I'm sure there are cases where it's faster to slurp the data and process it yourself than it is to let a RDBMS do it - though mostly tasks where a RDBMS isn't the best choice for data storage in the first place. Some tasks aren't well suited to relational sets - like modelling trees and graphs, and storing documents. I just think the OP does an absolutely terrible job of arguing the point. Why are people saying "WITH RECURSION" anyway? SQL:1999 uses "WITH RECURSIVE". Is "WITH RECURSION" an extension of some kind? Or are they just talking about something they've never actually used?

Bill Karwin replied on Mon, 2010/10/04 - 12:09am in response to: Dragan Sahpaski

Yes, I had a very similar thought as Dragan Shahpaski as I was reading this.  The problem is not SQL.  The problem is that the development & DBA team has a culture of educated guesses, relying on "it stands to reason that..." arguments instead of "let's test whether it's true that..."  This probably affects their work on any language, not just SQL.

Jon Martin Solaas replied on Mon, 2010/10/04 - 6:08am

It would be interesting to know if the brands column was indexed? If so sorting and full tablescan would not be needed. There are many ways to tune Oracle... Even if relational storage is not always best for all data, it is pretty good in most cases, and having all stuff in one place is often a good thing. Oracle has support for hierarchical retrieval of data (PRIOR), if all data is already in a Oracle datawarehouse, then you're pretty much already tied to Oracle and can use Oracle extensions. There are ofcourse other nice ways to store data, but actually I think this article more reflects your personal dislike of SQL than any real weaknesses of storing data in a RDBMS.

Jon Martin Solaas replied on Mon, 2010/10/04 - 6:23am

... and by the way, isn't the whole point with a datawarehouse to flatten out data in a non-normalized way so that the heavy queries you need to run often will execute blazingly fast? I'd say that the slow performance here is due to an ineffectively structured warehouse ...

Jean-pol Landrain replied on Mon, 2010/10/04 - 9:12am

Sorry, but I have the feeling this benchmark is a bit biaised: in the second example you should also eliminate the duplicates using code in order to achieve the exact same result as the first example. I'm sure the second example would be the slowest after this process, not because the execution of custom code would be slower than oracle/sql, but mainly because you'd transmit much more data than needed over the network (meaning also this could not necessarily be the case with an in-memory db). Benchmarks are good, but good bencharks are often difficult to put in place with a realistic scenario.

Cloves Almeida replied on Mon, 2010/10/04 - 8:02pm in response to: Jon Martin Solaas

My thoughts exactly. And if you get to the point you where have so much data that even other DW techniques won't cut it, throwing in more hardware is usually the cheaper then spending programmer time to do what the RDBMS is supposed to.

As an alternative there are column-based RDBMS that are very suited to the massive aggregation problems of BI.

Bikes Direct replied on Thu, 2011/08/11 - 1:51pm

What is missing from the article is the Python code that processes the (Select All)  output through a Python (Sets) to product the distinct brands. -BikesDirect

 

Thomas Kern replied on Thu, 2012/09/06 - 11:58am

if program needs SELECT DISTINCT that badly, why don't you keep a table with such results and select from it?

select * from distinct_brands; -- will do the trick

+ update, that table from time to time when new data arrives.

http://www.java-tips.org 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.