Tuesday, April 3, 2012

SAS DATASTEP MERGE AND SQL JOINS - A comparison


SAS DATASTEP MERGE AND SQL JOINS
A COMPARISON

Objective: The objective of this post is to get a reader get a good knowledge of datastep merges and SQL joins and provide a comparison between them. Also towards the end I would specifically outline the differences between these two to let you decide which of the two to use depending on your output requirements.
There are other ways as well to combine SAS datasets like multiple SET statements and hashing but I will be discussing them in a separate post.

Datastep merge:
A datastep merge is one of the most heavily used programming constructs in SAS. It helps us to combine two or more SAS datasets and outputs a single combined dataset containing all the variables from both datasets if not specified otherwise. I will be using small datasets to demonstrate the merging process.
Let us first look at the syntax:



Data dset1;
Merge dset2(in=a) dset3(in=b);
By byvar1 byvar2... ;
If a and b;
Run;

The datasets I will be using are as follows:

data a;
input id name $ ;
datalines;
1 a1
2 a2
3 a3
4 a4
;
run;


data b;
input id new_name $ ;
datalines;
1 b1
2 b2
3 b3
4 b4
;
run;

I will be updating these datasets according to requirements.
So first we will see a small example of merging datasets A and B to make a new dataset C.

data c;
merge a(in=a) b(in=b);
by id;
run;

The ouput is as follows:
        A                 B                     C



+



=


This is the simplest form of merge where there is one to one mapping of all the key variables.
Now we will look at proc sql joins and also compare them with corresponding datastep merge and point out the differences if any.

Proc SQL joins:
A Proc SQL join works same as a datastep merge to combine two or more SAS datasets. There are many types of SQL joins which can be used based on your output requirements.

1) Inner Join

An inner join provides only the matching rows from both datasets. Its syntax is:

Proc sql;
      create table company as
            select a.empid, b.name, b.salary
            from employee as a inner join salary as b
            where a.empid=b.emp_id;
Quit;

Here the final output will be all the employees having their salary information in the salary dataset. If the info for a employee is found in any one of the datasets then that employee will not be present in the final table.

1a) Inner Join with datastep:

The same inner join can be performed with a datastep merge.

data company;
  merge employee(in=a) salary(rename=(emp_id=empid) in=b);
  by empid;
  if a and b;
run;

We required the rename as datastep cannot perform a merge on id variables with different names.

Actually there is a way to merge by differently named id variables but that method is at from the notion of elegant coding and is just developed as a workaround as you can always rename id variables easily.

OK OK If you require it so much I will discuss the method towards the end.

2) Outer Join (Left)

A Left Outer join outputs the matching rows from both datasets as well as it also outputs the non-matching from the left dataset or the dataset specified first in the query. Its syntax is:

Proc sql;
      create table company as
            select a.empid, b.name, b.salary
            from employee as a left outer join salary as b
            where a.empid=b.emp_id;
Quit;

Here the final output will be all the employees whether or not they have their salary information in the salary dataset. If you want to list all the employees irrespective of their salary being updated in the salary dataset then you will use left outer join.

2a) Left outer Join with datastep:

The same join can be performed with a datastep merge. We just need to change the IF condition.

data company;
  merge employee(in=a) salary(rename=(emp_id=empid) in=b);
  by empid;
  if a;
run;

3) Outer Join (Right)
A Right Outer join outputs the matching rows from both datasets as well as it also outputs the non-matching from the right dataset or the dataset specified second in the query. Its syntax is:

Proc sql;
      create table company as
            select a.empid, b.name, b.salary
            from employee as a right outer join salary as b
            where a.empid=b.emp_id;
Quit;

Here the final output will be all the employees who have their salary information in the salary dataset as well as the employees who are not currently updated in employee dataset but their salary info was updated. The situation looks silly but in this case you will be using a right outer join.

One point to note here is that we are taking empid from the employee dataset so for these non-matching employees of the salary dataset the empid will be missing which is definitely not desirable. So as a precaution we generally use the COALESCE function in right outer joins.

Proc sql;
      create table company as
            select coalesce(a.empid,b.emp_id), b.name, b.salary
            from employee as a left outer join salary as b
            where a.empid=b.emp_id;
Quit;

This will solve the problem of missing employee ids.

3a) Right Outer Join with datastep:

The right outer join can be performed with a datastep merge similarly as the left outer.

data company;
  merge employee(in=a) salary(rename=(emp_id=empid) in=b);
  by empid;
  if b;
run;

4) Full Join

A Full join as you have rightly guessed by now outputs the matching rows from both datasets as well as it also outputs the non-matching from both the datasets. Its syntax is:

Proc sql;
      create table company as
            select coalesce(a.empid,b.emp_id), b.name, b.salary
            from employee as a full join salary as b
            where a.empid=b.emp_id;
Quit;

Here the final output will be all the employees who have their salary information either in the salary dataset or the employee dataset.

Here also we use COALESCE function or the same reason as in Right outer join.

4a) Full Join with datastep:

The full join can also be performed with a datastep merge. We just need to eliminate the if condition or for better understanding we can keep the condition as IF A OR B;

data company;
  merge employee(in=a) salary(rename=(emp_id=empid) in=b);
  by empid;
  if a or b;
run;

5) Cartesian Product

A simplest join in proc sql where the final out is each row of first dataset combined with each row of the second. Its syntax is:

Proc sql;
      create table company as
            select a.empid, b.name, b.salary
            from employee as a , salary as b
            ;
Quit;

Here the final output will be product of all rows from the employee dataset with all rows in the salary dataset.
Practically this looks wrong as everyone’s salary will be everyone others’ salary, perfect Socialism J

But surprisingly this Cartesian product is the basis of all joins. Whenever you specify a join a Cartesian product is done and the output rows are restricted by certain conditions. So knowing this is necessary.

6) Cartesian Product through a datastep

This technique is not very intuitive but is asked in a lot of interviews so I am including it here:

data every_combination;
  /* Set one of your data sets, usually the larger data set */
  set one;
  do i=1 to n;
    /* For every observation in the first data set,    */
    /* read in each observation in the second data set */
    set two point=i nobs=n;
    output;
  end;
 run;

This technique creates a product of all rows.
So above is the hint for answering the question in merging SAS datasets with different id variables names. Let’s see whether you can find the answer.

7) Many to many joins

These are not a type of join but a property of the data in the tables you are joining. I am including this here as a type of warning. This occurs when both tables have multiple instances of the same key variable. This can lead to unexpected results and the output is different from PROC SQL join to a DATASTEP MERGE in this case as proc sql is Cartesian product based while merge is one to one matching.

You can create a few examples and study the differences this will help you understand the process better.

Below are the some differences between a merge and a join :

MERGE
JOIN
By default MERGE does a one to one join which will have matching rows and all the remaining rown from both datasets

By default a Cartesian Product is produced;It means joining each row from one table to every row of the other table

Default Match merging is the same as full ouetr join in SQL

Match join by default gives the ouptput of an inner-join

Multiple dataset can be used while performing outer join.

Outer join can only be done on two datasets at a time.

Sorting is required.

Sorting is not required.

All variable names in datasets should be identical.

Joining is possible on columns with differing names


Many-to-many merge cannot be done
due to one-to-one behavior of MERGE
Relatively easy to do a many-to-many join.

No advantages while working with
database tables  as databases can do the processing faster.


Very advantageous when working with tables
stored in Database Servers because
databases are designed for SQL processing.




Conclusion: This paper gives basic information about the SQL merges and joins and is intended to be used as a starter or a reference in this topic.

Will be back with some more SAS magic. Goodbye Till then.

Saurabh Singh  Chauhan
(er.chauhansaurabh@gmail.com)         
Note: Comments and suggestions are always welcome.

Disclaimer :
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc.in the USA and other countries. ® Indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies. 
The contents of this post are the works of the author(s)and do not necessarily represent the opinions,recommendations, or practices of any organization whatsoever.