Sunday, March 24, 2013

Differences in UPADTE and MODIFY statements in SAS


Difference in UPDATE and MODIFY statements in SAS

Intoduction
Update and Modify are two underused statements in SAS programming language, but the questions regarding the working of each and the differences between them is a heavily asked question in SAS interviews, the latter one being more frequent.

This post sheds some light on working of update and modify with extra stress on comparing them and listing out the differences.

This post has enough information to get you through most of the interview questions, but is not an exhaustive material on UPDATE and MODIFY statements. For complete working of these please visit SAS website.
So let us start with explanation of each and then we’ll see the differences:



What is Updating:

 Updating a SAS dataset means replacing the variable values in one data (typically called the master dataset) with values from another data set (the transaction data set).
If optionUPDATEMODE=MISSINGCHECK, then blankvalues in a transaction dataset will not replace current values in a master data set.
If option UPDATEMODE=NOMISSINGCHECK, then blank values in transaction dataset replace preexisting values in the master data set.
By default it is set to MISSINGCHECK.

You update a data set by using the UPDATE statement along with a BY statement. Both of the input data sets must be sorted by the variable that you use in the BY statement. The following figure shows the results of updating a SAS data set.




As you can see the names for accounts numbers 6 and 8 are updated according to the transaction dataset.
Also a new account (Acc no. 9) has been added to the master dataset.

data master;
input acc_no first_name $ last_name $;
datalines;
1 A B
2 A B
3 A B
4 A B
5 A B
6 A B
7 A B
8 A B
;
run;


data transaction;
input acc_no first_name $ last_name $;
datalines;
6 C D
8 C D
9 C D
;
run;

data master;
      update master transaction;
      by acc_no;
run;

So think of the places where you’ll require an UPDATE. One place is a company dataset which stores the salaries of the employees. Every month the salaries are updated for existing employees and new joinees and their salaries are appended to the existing dataset.

Definition of Modifying:

When we MODIFY a SAS data set we replace  observations or parts of them in an existing data set. Modifying a SAS data set is similar to updating a SAS data set, but the following differences exist:
Modifying cannot create a new data set, while updating can. Though it is actually not a limitation as it results in less disk space being used.

Unlike updating, modifying does not require that the master data set or the transaction data set be sorted.
We can change an existing dataset by using the MODIFY statement with a BY statement. The following figure shows the results.



The code used for this is :

data master;
input acc_no first_name $ last_name $;
datalines;
1 A B
2 A B
3 A B
4 A B
5 A B
6 A B
7 A B
8 A B
;
run;

data transaction;
input acc_no first_name $ last_name $;
datalines;
6 C D
8 C D
;
run;

data master;
modify master transaction;
by acc_no;
run;

If we try to add a new value using MODIFY statement with BY statement  we get an error as MODIFY searches a match for all observations and if a match for any observations in the transaction dataset is not found, it thoriws an error. 
See the example below:

data transaction;
input acc_no first_name $ last_name $;
datalines;
6 C D
8 C D
9 C D
;
run;

data master;
      modify master transaction;
      by acc_no;
run;
ERROR: The TRANSACTION data set observation does not exist on the MASTER data set.
ERROR: No matching observation was found in MASTER data set.

Differences:

MODIFY statement, has overcome the limitations of SET, MERGE, and UPDATE statements. 
For eg, this statement:
  • Can be used to update a dataset in place(Does not make a copy)
  • Can use sequential, matching, or direct access method. 
  • Sorting is not required unlike MERGE and UPDATE while match merging.
  • Unlike Merge and UPDATE does not give an error if duplicate BY values exist in  master or transaction dataset or both.
  • Requires less disk space than UPDATE, MERGE etc. as a temporary copy of master dataset is not made.
·         
      The MODIFY statement can only be used to update values of existing data set variables; no changes can be made with respect to the structure of the program data vector that is created at execution time for the data set being modified.

The syntax for MODIFY and update are similar when performing matching access.
In both the master dataset is mentioned followed by the transaction dataset. Both require a BY statement which lists the variables whose values from the trans. dataset are used to find observations in the master data set. However, the execution processing for both statements is not same. 
Written below is how  observations for update are retrieved in  matching access. During execution,
  1. First the MODIFY statement pick up a row from the trans.n dataset;
  2. Secondly, MODIFY generates a WHERE condition, which specifies the value of the BY variable from the row of transaction dataset , to find and get a row from the master data set.
The following rules always apply when duplicate values exist:
  • If there are multiple observations with the same value for the BY variable in the master data set, MODIFY changes only the values for the first occurrence. For UPDATE statement is that processing does not continue if duplicate values of the BY variable are detected in the master data set.
  • If there are multiple observations with the same value for the BY variable in the transaction data set, MODIFY performs operations consecutively, a manner simi1ar to that of the UPDATE statement.
I found the table below on SAS website, which summarizes the differences between MERGE, MODIFY and UPADTE :

Comparing Modifying, Merging, and Updating Data Sets



Comparison
MERGE
UPDATE
MODIFY
Is sorting of datasets required?
Merge with By statement: Required
One on one merge: Not required
Required
Not required
Requirement of Unique By values
Not required
Master data set: Required
Transaction data set: Not required
Not required
Can variables be created or deleted
Can be created
Can be created
Cannot be created
Number of dsets that can be combined
Unlimited
Two
Two
Missing value handling
Non-missing values are overwritten
If there are missing values in transaction dataset the values in the maser are not updated
 If UPDATEMODE option is MISSINGCHECK then does not overwrite non-missing else overwrite.
Default: MISSINGCHECK


Conclusion: Update and Modify have a lot of things in common but have very specific differences and that is why this question comes up again and again in interviews. Hope this post will help you get the hang of these useful statements and gain you confidence.

Will be back with some more SAS magic. Goodbye till then.

References :

Saurabh Singh  Chauhan
(er.chauhansaurabh@gmail.com)         
Note: Comments and suggestions are always welcome.



Disclaimer :
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc.in the USA and other countries. ® Indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies. 
The contents of this post are the works of the author(s)and do not necessarily represent the opinions,recommendations, or practices of any organization whatsoever.