slowness of SQL UPDATE statements (vs SAS Data Step but any SQL advice welcome)

**JeenLeen** · 2018-12-12, 03:21 PM (ISO 8601)

There's lot of times when I need to update one data element in one table with corresponding values from another table.
I'm using SAS, so I can do this a number of ways. The default in SAS is to use a data step to merge the two tables. It'd doable, but the syntax lacks finesse to just grab 1 variable.

So I tend to prefer SQL, using an UPDATE statement.

However, I've found that, when I'm trying to update a huge table with values from another at-least-big table, it takes prohibitively too long to run. I had one example before when it ran for an hour or more before I stopped it and programmed it in data-step code, but I think I'm running into it again.

My question is: any advice on how to modify the SQL code to speed up efficiency, or is SQL just slow here?

Example: I have a table with about 6 million records (called mydata). Another with about 600,000 records. I want to pull update one variable based on a shared ID number, for the year 2016.

Code:

proc sql;
    update mydata
           set myVar=(SELECT myVar FROM otherdata WHERE otherdata.IDNum = mydata.IDNum)
           where myVar='' AND year='2016' and flag=1;
quit;

Would creating an index on the ID number be fruitful? This is a many-to-one match, with mydata having the ID show up multiple times and the other data just having it once. I think I heard that indexing on unique IDs isn't usually helpful. I could sort both by the ID, but I don't think having it sorted helps much with SQL.

I could break mydata into two tables, one for just 2016 and one with the rest of the data, then do the UPDATE, then combine the tables. I reckon the issue is that, for each 6 million records, it has to check each of the 600k records for a match. Hence doing 6 mil * 600k operations. (I'm not sure if the WHERE statement filters before or after the SET, but I think I read that SQL does all the work first and then filters out based on WHERE.)

**Douglas** · 2018-12-12, 03:48 PM (ISO 8601)

If otherdata.IDNum does not have an index already, then yes you should create one for it. That by itself will likely make it over 10,000 times faster.

**Jasdoif** · 2018-12-12, 04:17 PM (ISO 8601)

Originally Posted by JeenLeen

I think I heard that indexing on unique IDs isn't usually helpful.

...quite the opposite. A unique ID is the first thing you should have an index on, so the database engine can look at the index and get pointed directly to where the corresponding row is in the table (as opposed to scanning through all the rows in the table to look for matches). Like Douglas is saying, an index on otherdata.IDNum would negate a ton of overhead by itself.

Originally Posted by JeenLeen

(I'm not sure if the WHERE statement filters before or after the SET, but I think I read that SQL does all the work first and then filters out based on WHERE.)

While that depends on how the database engine parsing the SQL statement performs it...it's almost certainly going to filter mydata by the WHERE clause first, because a cross-table subquery is bound to be expensive and will be more expensive the more rows it has to query; it's an "ounce of prevention vs pound of cure" scenario.

**factotum** · 2018-12-13, 12:11 AM (ISO 8601)

Not familiar with SAS Data Step, but if this were Microsoft SQL Server you could do the update as a join rather than as a correlated subquery:

UPDATE mydata SET myVar = otherdata.myVar
FROM mydata
LEFT OUTER JOIN otherdata ON mydata.IDNum = otherdata.IDNum
where myVar='' AND year=2016 and flag=1;

No idea if that would be faster in this instance or would even work in the variant of SQL you're using, but worth a try, I think.

(Also, why are you using "year = '2016'" in your WHERE statement? If year is an integer, which seems likely, then you're just introducing a string conversion for no good reason).

**Manga Shoggoth** · 2018-12-13, 04:53 AM (ISO 8601)

In addition to the above...

Depending on how technical you wish to be, most database systems can tell you how they are actually running the query. Oracle has "explain plan" , for example. If you can get at that information it may show you what you need to fix the query.

**JeenLeen** · 2018-12-13, 10:13 AM (ISO 8601)

Next time I have to do this, I'll do the index thing. Thanks! I probably misread something back when studying index stuff.

EDIT: I just found out I have to re-run the data because the guy who gave me the input needs to regenerate it. So that means I get to test the index idea now. Yay.

As for Year="2016", Year is stored as a character string, so that's why that's how it is.

**veti** · 2018-12-16, 12:49 AM (ISO 8601)

Originally Posted by factotum

No idea if that would be faster in this instance or would even work in the variant of SQL you're using, but worth a try, I think.

That should work in any SQL variant, I think, it's all standard functions. And joining is nearly always dramatically faster than a subquery. Subqueries are always slow, and should only be used in the last resort.

**JeenLeen** · 2018-12-17, 09:34 AM (ISO 8601)

Originally Posted by veti

That should work in any SQL variant, I think, it's all standard functions. And joining is nearly always dramatically faster than a subquery. Subqueries are always slow, and should only be used in the last resort.

That's disappointing. I could easily do a DATA step merge or a SQL join like you mention, but what I liked about the subquery was its simplicity when typing the code. That is, it seemed really eloquent in being short and easy to comprehend what the code is supposed to do. I feel annoyed to use a more complicated syntax.

I guess I'm just being stubborn since I think my code's pretty, even if it's a nightmare on the 'what the computer actually does' end.

Though I do really appreciate this advice. It's good to realize the why behind why the code wasn't running well and to have a couple other alternatives.

**Grey_Wolf_c** · 2018-12-17, 11:55 AM (ISO 8601)

Originally Posted by JeenLeen

I guess I'm just being stubborn since I think my code's pretty, even if it's a nightmare on the 'what the computer actually does' end.

You are not. Code clarity is as important as code efficiency - sometime more. Computers continue to get faster, making inefficient code have less of an impact. But code no-one can read remains code no-one can read no matter how much faster it becomes with each upgrade.

That said, in this case, your performance is getting enough of a hit that you probably should go with the less clean code. Besides, to a DB person, the update-with-join is probably as clean as the update-with-subclause.

Grey Wolf

**Manga Shoggoth** · 2018-12-17, 01:13 PM (ISO 8601)

Originally Posted by JeenLeen

That's disappointing. I could easily do a DATA step merge or a SQL join like you mention, but what I liked about the subquery was its simplicity when typing the code. That is, it seemed really eloquent in being short and easy to comprehend what the code is supposed to do. I feel annoyed to use a more complicated syntax.

I guess I'm just being stubborn since I think my code's pretty, even if it's a nightmare on the 'what the computer actually does' end.

Though I do really appreciate this advice. It's good to realize the why behind why the code wasn't running well and to have a couple other alternatives.

Depending on the underlying DBMS, what should happen is that the query optimiser looks at the query and works out the best way of running it. If the index exists and is usable then the optimiser should promote the subquery to a join on the fly. You shouldn't have to rewrite the query, just make sure it has the indexes it needs.

If you have access to the underlying database (or the ear of a friendly Database Administrator) it is worth getting the query plan to see how the query is actually being run - that will usually point you to the best way of improving the query performance.

**factotum** · 2018-12-17, 01:50 PM (ISO 8601)

Originally Posted by Manga Shoggoth

Depending on the underlying DBMS, what should happen is that the query optimiser looks at the query and works out the best way of running it. If the index exists and is usable then the optimiser should promote the subquery to a join on the fly.

It takes time and data about the underlying tables for the query analyzer to do its job, though, and you don't really have control over either factor unless you're the actual DBA looking after the SQL server. Therefore, I think it never hurts to write your query in the most optimal way that you can in order to give the query analyzer less to do in finding the optimal way to run it.

(Oh, and I'd note that actually find the JOIN method more easy to grok than the subquery--I really don't like subqueries and always get confused over their syntax).

Thread: slowness of SQL UPDATE statements (vs SAS Data Step but any SQL advice welcome)

Thread Tools

Spoilers