This is part 2 of our advanced practice SQL interview questions and answers. We highly suggest that you read part 1 of our Advanced SQL interview Questions before reading this, since a lot of the concepts presented in this portion are discussed in more depth in part 1.
The problem is based on the tables presented below where salespeople have orders with certain customers that are in the Customers table.
In the tables above, each order in the Orders table is associated with a given Customer through the cust_id foreign key column that references the ID column in the Customer table.
Here is the problem: find the largest order amount for each salesperson and the associated order number, along with the customer to whom that order belongs to. You can present your answer in any database’s SQL – MySQL, Microsoft SQL Server, Oracle, etc.
The answer to the problem and explanation
This question seems to be quite simple – but as you will soon find out it is deceptively complex. For each salesperson, all we need to retrieve is the largest order amount, and the associated order number. In order to retrieve that information we should be able to simply do a join between the Orders and Salesperson tables wherever the Salesperson.ID is equal to the Orders.salesperson_id (this would be our join predicate). Then, we could group the results of that join by the Orders.salesperson_id column and retrieve both the highest valued order (by using max(Amount)), and the associated Order number.
Let’s say that we choose to write our answer in MySQL. In MySQL we could legally write some code that looks like this:
SELECT Orders.Number, max(Amount) FROM Orders JOIN Salesperson ON Salesperson.ID = Orders.salesperson_id GROUP BY Orders.salesperson_id
And, if we run that code above it will return this as a result:
Number max(Amount) 30 460 10 2400 50 720 20 1800
The problem with the data returned in MySQL
But, there’s a problem with the results returned from running the SQL above, and it should be fairly obvious once you actually look at the data in the tables we have above. Here is the problem: Order number 50 does not have an amount of “720″ – that amount actually belongs to order number 60. So, what is going on here? Why do the results return an order number that is not even in the same row as the max(Amount) of 720? And why are all of the other results correct?
Understanding the group by statement is critical
Well, we will have to explain a bit more about what’s going on with the group by. If you already read Part 1 of the advanced SQL interview questions then you should understand exactly what the problem is with the SQL above, and you can safely skip down to the section that says “New approach to the problem – start with a subquery”. Unless you want to reinforce the concepts presented in part 1, in which case we highly recommend that you read this entire explanation to this rather difficult interview question.
When we group by the salesperson ID, there will be one group created for each and every salesperson ID. So, there will be 6 groups created – 1 for ID of 1, another for ID of 2, and others for ID’s 5, 7, 8, and 11. Inside those groups will be any rows that share the same salesperson ID values.
When we select the max(Amount), MySQL will simply look for the highest value for Amount within each group and return that value. And when we select Orders.Number, MySQL is not going to return every Orders.Number value from each group – it is only going to select one value from each group.
Our SQL is not specific enough
But, the question is which order number should be returned from each group? Each group can potentially have more than just one order number as long as there are more than one rows belonging to the group. And that is the exact problem – the SQL that we wrote is not specific enough, and MySQL will just arbitrarily/randomly return one of the values of the Orders.Number within each group. In this case, because order number 50 is part of the group created by the salesperson_id’s of 7, it will return 50. MySQL could just as well have returned order numbers 60 or 70 – the point is that it just randomly chooses one order number from each group. For the group created by salesperson ID of 2, the fact that the order number 10 is chosen (order number 10 corresponds to the largest order amount of 2400) is just pure coincidence – MySQL could have returned us order number 40, which is also a part of the same group as salesperson ID of 2.
Most relational database implementations would have thrown an error if we tried to run the SQL above because the results are potentially arbitrary, as we just illustrated. MySQL is the exception, because it allows us to run the SQL above error-free, but as we illustrated the data returned could potentially not make any sense. Be sure to read Part 1 of the advanced SQL interview questions for more details on why.
Well, now we know that there is definitely an issue with the SQL above, so how can we write a good query that would give us exactly what we want – along with the correct order number?
New approach to the problem – start with a subquery
Now let’s instead just try to break the problem down into more manageable pieces – starting with a simple subquery. Here is a subquery to get the highest valued order for each salesperson:
SELECT salesperson_id, MAX(Amount) AS MaxOrder FROM Orders GROUP BY salesperson_id
Running the query above will return this:
The query above gives us the salesperson_id and that salesperson’s associated highest order amount – but it still does not give us the order number associated with the highest order amount. So, how can we find the order number as well?
Clearly we need to do something else with the subquery we have above that will also give us the correct order number. What are our options? Try to come up with an answer on your own before reading on.
Well, we can do a join with the results of the subquery above. But, on what condition should our join be done and what exactly should we be joining the subquery above with?
What if we join our subquery above with data from the Orders table, where the join is done on the basis that the salesperson_id matches, AND that the value in the Order table’s Amount column is equal to the amount (MaxOrder) returned from the subquery? This way, we can match up the correct Order Number with the correct corresponding value for the maximum Order Amount for a given salesperson_id.
With that in mind, we can write this query:
select salesperson_id, Number as OrderNum, Amount from Orders JOIN ( -- this is our subquery from above: SELECT salesperson_id, MAX(Amount) AS MaxOrder FROM Orders GROUP BY salesperson_id ) as TopOrderAmountsPerSalesperson USING (salesperson_id) where Amount = MaxOrder
Running the query above returns us this:
How does the query work exactly?
How does the query above work exactly? It’s actually pretty simple. First, the subquery (which is basically a derived table here, named TopOrderAmountsPerSalesperson) returns the orders with the highest dollar amounts per salesperson, and the associated salesperson ID. So, now we have each salesperson’s highest valued order and his/her ID in a derived table. That derived table (the results from the subquery) is then joined with the entire Orders table on the condition that the salesperson ID matches and that the Amount from the Orders table matches the MaxOrder amount returned from the derived table. What’s the point of this? Well, that join will give us the correct OrderNumber since it is matching on both the salesperson ID and the amount. Even if there are 2 rows with the same exact salesperson ID and amount it will not even matter because no matter which ordernumber is associated with that row, the result set will be exactly the same.
And remember that the whole reason we are doing this is to avoid the original problem with not being able to select a non-aggregated column with a group by.
Now, retrieving the salesperson name is simple. Try to figure it out on your own.
Here is how we retrieve the salesperson name – we just use another join with the Salesperson table and select the Name:
SELECT salesperson_id, Name, Orders.Number AS OrderNumber, Orders.Amount FROM Orders JOIN Salesperson ON Salesperson.ID = Orders.salesperson_id JOIN ( SELECT salesperson_id, MAX( Amount ) AS MaxOrder FROM Orders GROUP BY salesperson_id ) AS TopOrderAmountsPerSalesperson USING ( salesperson_id ) WHERE Amount = MaxOrder
Running the query above returns this:
And, finally we have our answer! But one last thing – let’s check for corner cases. What would happen if we add one more row to the table where a given salesperson has 2 or more orders that have the same value for the highest amount? For example, let’s add this row to the Orders table:
This now means that the salesperson with an ID of 2 has 2 orders with an amount of 2400 in the Orders table. And, if we run the SQL above again, we will get this as a result (note the extra row for Bob):
Now, the question is if we only want one of Bob’s orders to show up, how can we eliminate the duplicate? Again, try to figure this out on your own before reading our answer.
Well, we could add a GROUP BY salesperson_id, Amount to the end of the query, which would create separate groups for each unique combination of the salesperson ID and the Amount. This would give us a query that looks like this:
SELECT salesperson_id, Salesperson.Name, Number AS OrderNumber, Amount FROM Orders JOIN Salesperson ON Salesperson.ID = Orders.salesperson_id JOIN ( SELECT salesperson_id, MAX( Amount ) AS MaxOrder FROM Orders GROUP BY salesperson_id ) AS TopOrderAmountsPerSalesperson USING ( salesperson_id ) WHERE Amount = MaxOrder GROUP BY salesperson_id, Amount
Now, running this query even with the duplicate row in the Orders table would return us this:
And that’s it – we are now good to go, and we have a final answer to this difficult interview question! This concludes our series of complex SQL interview questions – hopefully you found them challenging!