Friday, 17 February 2012

Understanding SQL: Complex Queries


The previous article in the "Understanding SQL" series brought us from easy SELECT statements to more useful JOINs and subselects. This article will go into more advanced queries using Common Table Expressions (CTE) and aggregated results with the GROUP BY clause. The two don't need to go hand in hand, but this article will utilize them both.
If you are not very familiar with SQL and you haven't already done so, please read Understanding SQL: SELECT The Data You Want. It will give you a bit more than the basics and you should be able to understand where we take things in this article. Also, we are going to continue with the "ecommerce web site" example from that article.

What is a GROUP BY

Before we can go on, you need to know what a GROUP BY clause is. To put it simply, you can tell SQL how to "GROUP" records together. For example, if you wanted to know how many people had the same name, you could make a query:
SELECT FirstName, COUNT(*) FROM dbo.Customers GROUP BY FirstName
That would return a result set of all unique names in your Customers table, and how many times each name is found. Let's look at another query to help us understand what a GROUP BY is used for in SQL.
SELECT FirstName, MAX(ID) FROM dbo.Customers GROUP BY FirstName
This query is telling SQL that you want to know the highest ID of all users if they were grouped together by their names. So all the Bobs would get together, and whoever had the highest ID would come forward in the result set. Likewise, all the Timothys and all the Jonathans would do the same thing. Now let's continue on to more advanced queries.

On The Fly Views

Often times you may need to treat a small portion of a TABLE as if it were its own unique TABLE (or VIEW). But creating a VIEW for one query is a bit ridiculous. A good "for instance" would be if you wanted a report that displayed the latest purchase made by all customers. You wouldn't want to simply JOIN the Customers table to the Orders table as you would get all orders for every customer, not just their latest.
You have a few options here: Full JOIN then aggregate subqueries in the WHERE clause or a CTE (Common Table Expression). But the solution we are going to choose is the CTE, and I'll explain why in a bit. First of all, let's look at the first option (which is the option likely to be taken by most people). Here is what most people would do:
-- With the knowledge of a "GROUP BY" clause as explained above,
-- people go overboard and do something like this:


SELECT
   Customers.*,
   MAX(Orders.OrderTime) AS LatestOrderTime
FROM
   dbo.Customers INNER JOIN dbo.Orders
       ON Customers.ID = Orders.CustomerID
GROUP BY
   Customers.ID, Customers.FirstName, Customers.LastName, Customers.Address,
       Customers.City, Customers.State, Customers.Zip
What they don't realize is that every column that they have to add to the GROUP BY clause is going to cause more and more strain on SQL. Think about it, as each record comes in, SQL has to check to see if their is already a record with that same ID, FirstName, LastName, Address, City, State and Zip. And imagine if there were more columns in the Customers table; you'd be killing SQL for no reason.
Now like I said above, a CTE can treat a part of a table as if it were it's own unique table. Meaning, wouldn't it be nice if we had a table of only the most recent orders for every customer? Lets see how we can make a CTE that has the data we need, and include it in our query so that we don't have to GROUP BY so many fields.
-- Create a CTE (fake table or viewof the latest order IDs
WITH LatestOrders (ID) AS (SELECT MAX(ID) FROM dbo.Orders GROUP BY CustomerID)

-- Only retrieve the records that are in the "LatestOrders" CTE

SELECT
   Customers.*,
   Orders.OrderTime AS LatestOrderTime
FROM
   dbo.Customers INNER JOIN dbo.Orders
       ON Customers.ID = Orders.CustomerID
WHERE
   Orders.ID IN (SELECT ID FROM LatestOrders)
You may think that this query is more expensive (harder on the processor) than the one above using the massive GROUP BY, but it's not. In fact, it's about 100% faster and less processor intensive. If I added more records and got to have a large table, the CTE approach could be hundreds of times faster.
Something to keep in mind though is that the CTE is only available in SQL Server 2005, so if you are using SQL Server 2000 you have to do this the old way. The query will actually 'cost' the same to SQL (because it's really the same thing), but it will look a little uglier. Here it is using a subquery in the WHERE clause.
SELECT
   Customers.*,
   Orders.OrderTime AS LatestOrderTime
FROM
   dbo.Customers INNER JOIN dbo.Orders
       ON Customers.ID = Orders.CustomerID
WHERE
   Orders.ID IN (SELECT MAX(ID) FROM dbo.Orders GROUP BY CustomerID)
While this may seem easier than the CTE, you should remember that I'm keeping the query simple for this article. The CTE is a lot cleaner for another developer to come behind because you define your "fake table" at the top of your query, so people know right away what you are trying to do. Also, you give it a name such as "LatestOrders" which explains what you are using it for. Lastly, you can put it in your JOIN clause just like a real table.

Concluding Comments

One thing of note is that a CTE can only be used once in your query. So you can't declare your CTE at the top, then do multiple queries against it. However, you can make multiple CTE's and then use them together in one query. Also, everything has it's place, so if you find that you are building the same CTE multiple times for different reports / queries, then you might want to turn that into a VIEW instead.
If you've read the previous article mentioned above, and now this one, you should be able to make some useful queries (or reports). Try to build your queries logically, first explaining to yourself what you want the result to be. Then piece by piece add the code together until you reach the end result. If you find that by adding another JOIN or column to your output suddently breaks what was coming together nicely, then remove it and rethink what you did.
Here is a final example report that you should be able to build with the information in these articles. We'll make a "Customer Activity Report" that will show the latest purchase date, and how many items have been purchased total for each customer. Here's the code:
WITH LatestOrders (ID) AS (SELECT MAX(ID) FROM dbo.Orders GROUP BY CustomerID)

SELECT
   Customers.*,
   Orders.OrderTime AS LatestOrderTime,
   (SELECT COUNT(*) FROM dbo.OrderItems WHERE OrderID IN
       (SELECT ID FROM dbo.Orders WHERE CustomerID = Customers.ID))
           AS TotalItemsPurchased
FROM
   dbo.Customers INNER JOIN dbo.Orders
       ON Customers.ID = Orders.CustomerID
WHERE
   Orders.ID IN (SELECT ID FROM LatestOrders)
Your results should look something like this:
Latest order time and total items purchased per customer



Reference:
http://www.singingeels.com/Articles/Understanding_SQL_Complex_Queries.aspx

No comments:

Post a Comment