Monday, March 19, 2012

Efficiency Techniques in SAS


Increase efficiency, earn more money
Efficiency Techniques in SAS

Objective: The first heading is just a fake one to get your attention, now when I have you attention I’ll reveal my real objective. The objective of this post is to discuss some of the techniques which can reduce your code running time and also save you from memory buffer full conditions in case of large data.

The Techniques:

1) DROP KEEP DATASET OPTIONS/STATEMENTS:

Data out (drop=time);
set in;
run;

Data out;
set in (keep=time);
run;

Now you’ll say what new is there in these we use them daily. Yes but it is always worthwhile to give a thought on where you are using them. You are creating a report from Raw data which you will modify to get final data. If you use these options or statements in the last step which is creating final data the these will be of no use in terms of efficiency. So I just want to say that use them smartly or in other way as early as possible so that SAS will be saved from the burden of carrying unuseful variables till the end. Of course you need to plan before you code to identify which variables you’ll need.



Also it is good to keep the options in set statement instead of data statement.

2) USE COLUMN INPUT IF POSSIBLE:
Data out;
      Infile in;
      input
      Name $1-10
      Address $20-26;
run;

Column input is the most efficient type of input. Using column style input allows you to create a subsetted version of an  external file containing only the variables needed for  analysis, as opposed to reading all of the variables in and dropping the superfluous ones.

3) Use subsetting IF:

Always make use of subsetting IF to eliminate unnecessary observations. There is an important point which people generally ignore in using subsetting IF. Always use the subsetting IF(if the condition is based on preexisting variables in a datset) before most of the calculations, as obviously it will save time on doing calculation in observation which will be dropped finally thus saving precious run time when dealing with large datasets.
To prove my point I have taken an example.
I created test1 and test2 datasets with same calculation and output except the in first one the subsetting IF is used after calculations and in the second before calculation.




67   data test1;
68   set test;
69   k=i*5;
70   w=i*6;
71   n=i*7;
72   if i < 10000000;
73   run;

NOTE: There were 100000000 observations read from the data set WORK.TEST.
NOTE: The data set WORK.TEST1 has 9999999 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            9.43 seconds


74
75   data test2;
76   set test;
77   if i < 10000000;
78   k=i*5;
79   w=i*6;
80   n=i*7;
81   run;

NOTE: There were 100000000 observations read from the data set WORK.TEST.
NOTE: The data set WORK.TEST2 has 9999999 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            8.90 seconds



You can see the time difference yourself with just 10 million observations and just 1 small calculation, so think what wonders can this small change do when applied to huge datasets and lot of calculations.

4) Use Indexes:
Indexes can provide huge efficiency improvements when retrieving a small subset of data from a large dataset. There are multiple ways to create an index, some of them are below:

/*PROC datasets*/
proc datasets lib=work memtype=data;
  modify test1;
  index create composit=(i k);
run;quit;

/*PROC SQL*/
proc sql;
  create index composit on
  work.test2(i,k);
quit;

/*DATA Step*/
data work.test3(index=(composit=(i k)));
  set work.test3;
run;

Using index is also efficient than sorting a dataset as an indexed dataset in not required to be sorted.
Please note that indexes has certain disadvantages which must be kept in mind before creating indexes :

1) First one is that they take extra SPACE and so if you have a space crunch on your server, avoid indexes.

2) As data in a table is inserted, modified, or deleted, an index must be updated to address those changes. This automatic feature requires additional CPU resources to process any changes to a table.

5) Remote Processing:

SAS/CONNECT Software enables users who use a client install of the SAS System to connect to a remote server. Thereby taking full advantage of the server’s greater storage capacity and processing power. Once a signon script has been executed users may communicate with the server by using RSUBMIT blocks, files may be moved from client to sever and vice versa using either PROC UPLOAD/DOWNLOAD or RLS (Remote Library Services).

For example:
rsubmit;
      proc upload in=sashelp out=work;
      select class;

      data sample;
      set class(where=(age > 30));

      proc download in=work out=sasuser;
      select sample;
      run;
endrsubmit;

6) Use IF-THEN-ELSE wisely:

I will explain you with examples what I mean to say :

1) For e.g. you have 6 conditions out of which only one can be true then use and if then else instead of a list of IF conditions as below :

/* Bad use of IF*/
if    age le 10 then pocket_money=10;
if    age gt 10 and age le 20 then pocket_money=50;
if    age gt 20 and age le 30 then pocket_money=50;
if    age gt 30 and age le 40 then pocket_money=50;
if    age gt 40 and age le 50 then pocket_money=50;
if    age gt 50 and age le 60 then pocket_money=50;

/* Good use of IF ELSE */
if    age le 10 then pocket_money=10;
else if     age gt 10 and age le 20 then pocket_money=50;
else if     age gt 20 and age le 30 then pocket_money=50;
else if     age gt 30 and age le 40 then pocket_money=50;
else if     age gt 40 and age le 50 then pocket_money=50;
else if     age gt 50 and age le 60 then pocket_money=50;

Also best will be to list the conditions in IF-THEN-ELSE in decreasing order of probability.

2) While using AND conditions keep the condition first which is more likely to be false and in OR keep the condition first which is more likely to be true as in thse cases the second condition will not be evaluated.

7) Use FORMATS and not IF-THEN-ELSE when assigning values based on criteria:

When you have to assign one value to a variable from a list of values based on a criteria it is more efficient to use proc format than using a series of IF-THEN-ELSE statements.
Example:

/*Less Efficient*/
data out;
      set in;
      if oldno = 0 then newno="< 3 yrs old";
      else if oldno=1 then newno="no school";
      else if oldno=2 then newno="nursery school";
      :
      else if oldno=8 then newno="11th grade";
      else if oldno=9 then newno="12th but nongrad";
run;

/*More efficient:*/
proc format;
      value oldnof
        0="< 3 yrs old"
            1="no school"
            2="nursery school"
            3="kindergarten"
            4="thru 4th grade"
            5="thru 8th grade"
            6="9th grade"
            7="10th grade"
            8="11th grade"
            9="12th but nongrad";
run;
data out;
      set in;
      newno=put(oldno,oldnof.);
run;

8) Use the COMPRESS= data set option when creating large SAS datasets:

Using compress option when creating large datasets reduce space usage and Input Output time.

data work.test1(compress=yes);

set work.test;

run;


9) Use concatenation to reduce the number of variables you need to sort by:

Sorting large datasets by multiple variables take up a lot of CPU time. So a tip for improving efficiency is to create a sort variable by concatenating all the variables which are used for sort and use that for sorting. In case you have some numeric variables convert them to character to concatenate.

/*Less Efficient*/
proc sort;
      by a1 a2 a3 a4 a5;
run;

/*More efficient*/
data out;
      set in;
      sortvar=a1||a2||a3||a4||a5;
run;
proc sort;
      by sortvar;
run;

Conclusion: These are few of the numerous small techniques you can use while doing your day to day programming and get drastic performance improvements.

So make them a habit because in a world of ever-growing need of computing resources efficiency never hurts..!!

Will be back with some more magic of SAS. Till then Goodbye.


Saurabh Singh Chauhan
(er.chauhansaurabh@gmail.com)         
Note: Comments and suggestions are always welcome.

Disclaimer :
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc.in the USA and other countries. ® Indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies. 
The contents of this post are the works of the author(s)and do not necessarily represent the opinions,recommendations, or practices of any organization whatsoever.

9 comments:

  1. Thank you Saurabh These techniques will be very helpfull for getting me in sas programmers industry.

    ReplyDelete
    Replies
    1. Thanks Ganesh...you can mail me directly if you want any other help.. cheers

      Delete
    2. Hi Saurabh,

      I am working with a UK Bank, in sas analytics.

      Let me know your email ID and number.

      thanks
      Deepanker

      Delete
    3. Hi Deepanker.. My email is er.chauhansaurabh@gmail.com...please drop me an email.

      Delete
  2. Overall, your tips have merit. I suggest using Compress=yes at the option level and NOT at the dataset level. This way, all datasets will be compressed without having to repeat the compress=yes at each dataset.

    A bit surprised you did not mention using a subsetting where clause to save on I/O.

    You also did not mention making Permanent Format Libraries which will be more efficient than temp format libraries but that could be considered beyond intermediate level SAS programmers.

    HTH,
    Chuck P.

    ReplyDelete
    Replies
    1. Thanks Chuck....
      Yes using compress at option level will save a lot of extra typing...but will it not hurt efficiency compressing every datasets in the code...i was just wondering.
      Yes i also realized that i forgot the subsetting where after i pubished the post. i guess i'll edit this post when i'll have time..thanks for reminding :-)
      I guess i should write a separate post on permanent format libraries..
      Cheers
      Saurabh

      Delete
  3. I have seen it said that 80% of processing in any typical SAS system is sorting. Use of Indexes and format lookups can avoid sorts.

    In point 7, the use of Select otherwise statements should also be considered over a sequence of else ifs

    ReplyDelete
  4. Good Effort.. There is an exponential advantage in processor time when 'where' is used instead of 'if'..
    Also inthe spirit of the title..
    your sortvar can be created by sortvar =catx('',a1,a2,a3,a4,a5)

    Cheers!
    Zintie

    ReplyDelete
    Replies
    1. Thanks..
      Yes you are right where is more efficient than if..
      Thanks for the sortvar tip..

      Delete