Wednesday, May 1, 2013

How to create a Cartesian product using a datastep ?



How to create a Cartesian product using a datastep ?

Objective : Hello readers. This post is unlike all my previous posts which deal with one or the other concepts of SAS and explain them in detail. The idea for this post came up from one of the interview questions which I have been asked many times over (and I am asking it now when I am on the other side of the table :-) ).

In one of my previous posts I have compared the datastep merges with SQL joins, but no one ever tells how to do in datastep what is called a Cartesian product in SQL, and which is pretty easy to do in SQL. Interestingly Cartesian product forms tha base of SQl joins, so if we can replicate SQL joins in datastep then we should also be able to replicate Cartesian product..!!

The answer is Yes we can do it, and easily. But you’ll know it only when you see it.
So here is the short and simple way to do it, but before that let me explain for people who don’t know about Cartesian product, what it is and what is a point option(Shhh..this is the key)



Cartesian product:

This is a mathematical term which is used in set theory. A Cartesian product of set A and set B is a set of all ordered pairs of the elements of A and B.

That was complicated..!!  Thank GOD we don’t have to pass a mathematics paper anymore.

Ok, forget maths, in SAS when we do a Cartesian product of dataset A and dataset B, then we create a dataset containing all rows of A horizontally combined with all rows of B.
So if A has 3 observations and B has 4, Then the Cartesian product will have 12 observation and all the unique columns of A and B.

In SQL we do a Cartesian product similar to a join but just eliminating  giving any join names or or joining conditions. Here is the syntax:

Proc sql;
      create table cartesian1 as
            select * from cart1,cart2;
Quit;

Simple right..!!

Though it is not this simple in datastep and that is why I never got correct answer from any of the interviewees till date. Actually the reason is they have never seen it before.
Also using this technique requires a knowledge of point option in SAS datastep. Here is a brief overview of the option:

Point option:

The point creates a temporary variable whose numeric value determines which observation is read in a set statement. POINT= can cause the SET statement to use random (direct) access to read a SAS data set instead of sequential. Here is an example:

data test;
      do i=6,11,16;
            set sashelp.class point=i;
            output;
      end;
      stop;
run;

The point option is used to directly read observation number 6,11 and 16 from another dataset without reading any other observation. This is a dumb example but this option can be put to a lot of good uses and clever and efficient coding. We’ll talk about it later.

Cartesian product using a datastep:

Below is the code that will create the datasets I will use for Cartesian product and the third step will create a Cartesian product exactly the same as we created using proc sql above.

data cart1;
input id name $;
datalines;
1 test1
2 test2
3 test3
;
run;

data cart2;
input grp1 $ grp2 $;
datalines;
xx pp
yy qq
zz rr
aa ss
;
run;

/* Creates cartesian product */
data cartesian;
      set cart1;
      do i=1 to 4;
            set cart2 point=i;
            output;
      end;
run;

The third step creates a Cartesian product and if you read this post from start, you won’t need me to explain the functioning.

Going flexible:

As one of my friends on linkedin pointed out, that there is some hardcoding in this step owing to which the code is losing its flexibility. I have mentioned the number of observations(4) in the loop, which makes the above step not easy to use as you need to first get the observation in the second dataset and then change the number in the code.
We know that SAS can count the number of observation in a dataset, so why not let SAS do the counting while we are having a coffee..!!

The code to do this will use nobs option as follows:


data cartesian;
      set cart1;
      do i=1 to totobs;
            set cart2 point=i nobs=totobs;
            output;
      end;
run;

So we get the final code to do cartesian product and it uses just 7 lines..!!

Conclusion:Frankly speaking Creating a cartesian product using a datastep is not a very elegant technique, Just read this post for the sake of answering that final interview question which will seal your dream job. When it comes to creating a Cartesian product in code just use proc sql which is much cleaner and easier to understand.

Will be back with some more SAS magic. Goodbye till then.


Saurabh Singh  Chauhan
(er.chauhansaurabh@gmail.com)         
Note: Comments and suggestions are always welcome.



Disclaimer :
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc.in the USA and other countries. ® Indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies. 
The contents of this post are the works of the author(s)and do not necessarily represent the opinions,recommendations, or practices of any organization whatsoever.


No comments:

Post a Comment