How to create a Cartesian
product using a datastep ?
Objective : Hello readers. This post is unlike
all my previous posts which deal with one or the other concepts of SAS and
explain them in detail. The idea for this post came up from one of the
interview questions which I have been asked many times over (and I am asking it
now when I am on the other side of the table :-) ).
In one of my previous posts I have compared the
datastep merges with SQL joins, but no one ever tells how to do in datastep
what is called a Cartesian product in SQL, and which is pretty easy to do in
SQL. Interestingly Cartesian product forms tha base of SQl joins, so if we can
replicate SQL joins in datastep then we should also be able to replicate
Cartesian product..!!
The answer is Yes we can do it, and easily. But
you’ll know it only when you see it.
So here is the short and simple way to do it, but
before that let me explain for people who don’t know about Cartesian product,
what it is and what is a point option(Shhh..this is the key)
Cartesian product:
This is a mathematical term which is used in set
theory. A Cartesian product of set A and set B is a set of all ordered pairs of
the elements of A and B.
That was complicated..!! Thank GOD we don’t have to pass a mathematics
paper anymore.
Ok, forget maths, in SAS when we do a Cartesian
product of dataset A and dataset B, then we create a dataset containing all
rows of A horizontally combined with all rows of B.
So if A has 3 observations and B has 4, Then the
Cartesian product will have 12 observation and all the unique columns of A and
B.
In SQL we do a Cartesian product similar to a join
but just eliminating giving any join
names or or joining conditions. Here is the syntax:
Proc
sql;
create table cartesian1
as
select * from
cart1,cart2;
Quit;
Simple right..!!
Though it is not this simple in datastep and that is
why I never got correct answer from any of the interviewees till date. Actually
the reason is they have never seen it before.
Also using this technique requires a knowledge of
point option in SAS datastep. Here is a brief overview of the option:
Point option:
The point creates a temporary variable whose numeric
value determines which observation is read in a set statement. POINT= can cause
the SET statement to use random (direct) access to read a SAS data set instead
of sequential. Here is an example:
data
test;
do i=6,11,16;
set
sashelp.class point=i;
output;
end;
stop;
run;
The point option is used to directly read
observation number 6,11 and 16 from another dataset without reading any other
observation. This is a dumb example but this option can be put to a lot of good
uses and clever and efficient coding. We’ll talk about it later.
Cartesian product using a datastep:
Below is the code that will create the datasets I
will use for Cartesian product and the third step will create a Cartesian
product exactly the same as we created using proc sql above.
data
cart1;
input id name $;
datalines;
1 test1
2 test2
3 test3
;
run;
data
cart2;
input grp1 $ grp2 $;
datalines;
xx pp
yy qq
zz rr
aa ss
;
run;
/* Creates cartesian product */
data
cartesian;
set cart1;
do i=1 to 4;
set cart2
point=i;
output;
end;
run;
The third step creates a Cartesian product and if
you read this post from start, you won’t need me to explain the functioning.
Going flexible:
As one of my friends on linkedin pointed out, that there is some hardcoding in this step owing to which the code is losing its flexibility. I have mentioned the number of observations(4) in the loop, which makes the above step not easy to use as you need to first get the observation in the second dataset and then change the number in the code.
We know that SAS can count the number of observation in a dataset, so why not let SAS do the counting while we are having a coffee..!!
The code to do this will use nobs option as follows:
data cartesian;
set cart1;
do i=1 to totobs;
set cart2 point=i nobs=totobs;
output;
end;
run;
So we get the final code to do cartesian product and it uses just 7 lines..!!
Going flexible:
As one of my friends on linkedin pointed out, that there is some hardcoding in this step owing to which the code is losing its flexibility. I have mentioned the number of observations(4) in the loop, which makes the above step not easy to use as you need to first get the observation in the second dataset and then change the number in the code.
We know that SAS can count the number of observation in a dataset, so why not let SAS do the counting while we are having a coffee..!!
The code to do this will use nobs option as follows:
data cartesian;
set cart1;
do i=1 to totobs;
set cart2 point=i nobs=totobs;
output;
end;
run;
So we get the final code to do cartesian product and it uses just 7 lines..!!
Conclusion:Frankly speaking Creating a cartesian product using a datastep is not a very elegant technique, Just read this post for the sake of answering that final
interview question which will seal your dream job. When it comes to creating a
Cartesian product in code just use proc sql which is much cleaner and easier to
understand.
Will be back with some more SAS magic. Goodbye till then.
Saurabh Singh Chauhan
(er.chauhansaurabh@gmail.com)
Note: Comments and suggestions are always welcome.
Disclaimer :
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc.in the USA and other countries. ® Indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc.in the USA and other countries. ® Indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.
The contents of this post are the works of the
author(s)and do not necessarily represent the opinions,recommendations, or
practices of any organization whatsoever.
No comments:
Post a Comment