Increase efficiency,
earn more money
Efficiency Techniques
in SAS
Objective: The
first heading is just a fake one to get your attention, now when I have you
attention I’ll reveal my real objective. The objective of this post is to
discuss some of the techniques which can reduce your code running time and also
save you from memory buffer full conditions in case of large data.
The Techniques:
1) DROP KEEP DATASET OPTIONS/STATEMENTS:
Data
out (drop=time);
set in;
run;
Data
out;
set in (keep=time);
run;
Now you’ll say what new is there in these we use them
daily. Yes but it is always worthwhile to give a thought on where you are using
them. You are creating a report from Raw data which you will modify to get final
data. If you use these options or statements in the last step which is creating
final data the these will be of no use in terms of efficiency. So I just want
to say that use them smartly or in other way as early as possible so that SAS
will be saved from the burden of carrying unuseful variables till the end. Of
course you need to plan before you code to identify which variables you’ll
need.
Also it is good to keep the options in set statement
instead of data statement.
2) USE COLUMN INPUT IF POSSIBLE:
Data
out;
Infile in;
input
Name $1-10
Address $20-26;
run;
Column input is the most efficient type of input.
Using column style input allows you to create a subsetted
version of an external file containing only
the variables needed for analysis, as
opposed to reading all of the variables in and dropping the superfluous ones.
3) Use subsetting IF:
Always make use of subsetting IF to eliminate
unnecessary observations. There is an important point which people generally
ignore in using subsetting IF. Always use the subsetting IF(if the condition is
based on preexisting variables in a datset) before most of the calculations, as
obviously it will save time on doing calculation in observation which will be
dropped finally thus saving precious run time when dealing with large datasets.
To prove my point I have taken an example.
I created test1 and test2 datasets with same
calculation and output except the in first one the subsetting IF is used after
calculations and in the second before calculation.
67 data test1;
68 set test;
69 k=i*5;
70 w=i*6;
71 n=i*7;
72 if i < 10000000;
73 run;
NOTE: There were 100000000 observations read from the data set WORK.TEST.
NOTE: The data set WORK.TEST1 has 9999999 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 9.43 seconds
74
75 data test2;
76 set test;
77 if i < 10000000;
78 k=i*5;
79 w=i*6;
80 n=i*7;
81 run;
NOTE: There were 100000000 observations read from the data set WORK.TEST.
NOTE: The data set WORK.TEST2 has 9999999 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 8.90 seconds
You can see the time difference yourself with just 10 million observations and just 1 small calculation, so think what wonders can this small change do when applied to huge datasets and lot of calculations.
4) Use Indexes:
Indexes can provide huge efficiency improvements
when retrieving a small subset of data from a large dataset. There are multiple
ways to create an index, some of them are below:
/*PROC datasets*/
proc
datasets lib=work memtype=data;
modify test1;
index create composit=(i k);
run;quit;
/*PROC SQL*/
proc
sql;
create index composit on
work.test2(i,k);
quit;
/*DATA Step*/
data
work.test3(index=(composit=(i k)));
set work.test3;
run;
Using index is also efficient than sorting a dataset
as an indexed dataset in not required to be sorted.
Please note that indexes has certain disadvantages
which must be kept in mind before creating indexes :
1) First one is that they take extra SPACE and so if
you have a space crunch on your server, avoid indexes.
2) As data in a table is inserted,
modified, or deleted, an index must be updated to address those changes. This
automatic feature requires additional CPU resources to process any changes to a
table.
5) Remote Processing:
SAS/CONNECT Software enables users who use a client
install of the SAS System to connect to a remote server. Thereby taking full
advantage of the server’s greater storage capacity and processing power. Once a
signon script has been executed users may communicate with the server by using
RSUBMIT blocks, files may be moved from client to sever and vice versa using
either PROC UPLOAD/DOWNLOAD or RLS (Remote Library Services).
For example:
rsubmit;
proc
upload in=sashelp out=work;
select class;
data
sample;
set class(where=(age > 30));
proc
download in=work out=sasuser;
select sample;
run;
endrsubmit;
6) Use IF-THEN-ELSE wisely:
I will explain you with examples what I mean to say
:
1) For e.g. you have 6 conditions out of which only
one can be true then use and if then else instead of a list of IF conditions as
below :
/* Bad use of IF*/
if age le 10 then pocket_money=10;
if age gt 10 and age le 20 then pocket_money=50;
if age gt 20 and age le 30 then pocket_money=50;
if age gt 30 and age le 40 then pocket_money=50;
if age gt 40 and age le 50 then pocket_money=50;
if age gt 50 and age le 60 then pocket_money=50;
/* Good use of IF ELSE */
if age le 10 then pocket_money=10;
else if age gt 10 and age le 20 then
pocket_money=50;
else if age gt 20 and age le 30 then
pocket_money=50;
else if age gt 30 and age le 40 then
pocket_money=50;
else if age gt 40 and age le 50 then
pocket_money=50;
else if age gt 50 and age le 60 then pocket_money=50;
Also best will be to list the conditions in
IF-THEN-ELSE in decreasing order of probability.
2) While using AND conditions keep the condition
first which is more likely to be false and in OR keep the condition first which
is more likely to be true as in thse cases the second condition will not be
evaluated.
7) Use FORMATS and not IF-THEN-ELSE when assigning values based on
criteria:
When you have to assign one value to a variable from
a list of values based on a criteria it is more efficient to use proc format
than using a series of IF-THEN-ELSE statements.
Example:
/*Less Efficient*/
data
out;
set in;
if oldno = 0 then newno="< 3 yrs old";
else if oldno=1 then newno="no school";
else if oldno=2 then newno="nursery school";
:
else if oldno=8 then
newno="11th grade";
else if oldno=9 then newno="12th but nongrad";
run;
/*More efficient:*/
proc
format;
value oldnof
0="< 3 yrs old"
1="no school"
2="nursery school"
3="kindergarten"
4="thru 4th grade"
5="thru 8th grade"
6="9th grade"
7="10th grade"
8="11th grade"
9="12th but nongrad";
run;
data
out;
set in;
newno=put(oldno,oldnof.);
run;
8) Use the COMPRESS= data set option when creating large SAS datasets:
Using compress option when creating large datasets
reduce space usage and Input Output time.
data
work.test1(compress=yes);
set work.test;
run;
9) Use concatenation to reduce the number of variables you need to sort
by:
Sorting large datasets by multiple variables take up
a lot of CPU time. So a tip for improving efficiency is to create a sort
variable by concatenating all the variables which are used for sort and use
that for sorting. In case you have some numeric variables convert them to
character to concatenate.
/*Less Efficient*/
proc
sort;
by a1 a2 a3
a4 a5;
run;
/*More efficient*/
data
out;
set in;
sortvar=a1||a2||a3||a4||a5;
run;
proc
sort;
by sortvar;
run;
Conclusion: These are few of the numerous small techniques you can use while doing
your day to day programming and get drastic performance improvements.
So make them a habit because in a world of ever-growing need of
computing resources efficiency never hurts..!!
Will be back with some more magic of SAS. Till then Goodbye.
Saurabh Singh Chauhan
(er.chauhansaurabh@gmail.com)
Note:
Comments and suggestions are always welcome.
Disclaimer :
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc.in the USA and other countries. ® Indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies. The contents of this post are the works of the author(s)and do not necessarily represent the opinions,recommendations, or practices of any organization whatsoever.
Disclaimer :
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc.in the USA and other countries. ® Indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies. The contents of this post are the works of the author(s)and do not necessarily represent the opinions,recommendations, or practices of any organization whatsoever.
Thank you Saurabh These techniques will be very helpfull for getting me in sas programmers industry.
ReplyDeleteThanks Ganesh...you can mail me directly if you want any other help.. cheers
DeleteHi Saurabh,
DeleteI am working with a UK Bank, in sas analytics.
Let me know your email ID and number.
thanks
Deepanker
Hi Deepanker.. My email is er.chauhansaurabh@gmail.com...please drop me an email.
DeleteOverall, your tips have merit. I suggest using Compress=yes at the option level and NOT at the dataset level. This way, all datasets will be compressed without having to repeat the compress=yes at each dataset.
ReplyDeleteA bit surprised you did not mention using a subsetting where clause to save on I/O.
You also did not mention making Permanent Format Libraries which will be more efficient than temp format libraries but that could be considered beyond intermediate level SAS programmers.
HTH,
Chuck P.
Thanks Chuck....
DeleteYes using compress at option level will save a lot of extra typing...but will it not hurt efficiency compressing every datasets in the code...i was just wondering.
Yes i also realized that i forgot the subsetting where after i pubished the post. i guess i'll edit this post when i'll have time..thanks for reminding :-)
I guess i should write a separate post on permanent format libraries..
Cheers
Saurabh
I have seen it said that 80% of processing in any typical SAS system is sorting. Use of Indexes and format lookups can avoid sorts.
ReplyDeleteIn point 7, the use of Select otherwise statements should also be considered over a sequence of else ifs
Good Effort.. There is an exponential advantage in processor time when 'where' is used instead of 'if'..
ReplyDeleteAlso inthe spirit of the title..
your sortvar can be created by sortvar =catx('',a1,a2,a3,a4,a5)
Cheers!
Zintie
Thanks..
DeleteYes you are right where is more efficient than if..
Thanks for the sortvar tip..