Sunday, January 29, 2012

READING EXTERNAL DATA FILES IN SAS (Part 2)


READING EXTERNAL DATA FILES IN SAS
PART 2 : Reading RAW data

Objective : This article aims at discussing various ways of reading raw data in a SAS dataset in an objective manner which is easy to understand for the user and also help the user understand when which method is to be used.

Theory : We SAS programmers are so obsessed to working with datasets that we generally ignore the fact that most of the real world data is not in SAS datasets. The types of data storage include legacy mainframe files, client/server RDBMS, and simple text files on computers. While we want to use SAS to examine, analyze and display this data, SAS datasets are . So we have to convert the data to SAS data sets from its native form.

POINTING SAS TO THE EXTERNAL FILE :There are two ways to point SAS to external files:-
1) FILENAME statements
2) INFILE statements.

INFILE Statement:

INFILE statement is used to tell a DATA step from which file to read when an INPUT statement is executed. If the argument after the keyword ‘INFILE’ is quoted, the SAS Supervisor treats it as an actual external file name. Otherwise, the SAS Supervisor assumes that it is a fileref. An INFILE statement should be used in each data step reading an external file and must be executed before the input statement. You can have more than one INFILE statement, which allows you to read multiple files with a single DATA step.

If the INFILE statement is absent from a DATA step, the INPUT statement will default to INFILE CARDS.

When SAS begins each iteration of a DATA step, it "forgets" from which file(s) it had read. This means that you must be careful to execute an INFILE for each DATA step that must reference an external file.
The syntax of INFILE statement is:

INFILE file-spec <options> <host options>;

file-spec identifies the source of the data.file-spec may have 3 forms:

1) fileref :- fileref assigned to the required external file. The fileref must be assigned before the DATA step by using a FILENAME statement.

2) ‘external-file’ :- specifies the complete path and name of the required external file. This form is equivalent to specifying the external file with a FILENAME statement.

3) CARDS or DATALINES :- indicate that the data immediately follows the CARDS or DATALINES statement at the end of the data step.
options specify SAS options to control reading the file or to provide information about the file.

FILENAME Statement:

This statement creates a file reference that links a physical file with the SAS system. The FILENAME statement should be executed within a SAS session prior to the DATA step in which the fileref is used.

The syntax of the FILENAME statement is:

FILENAME fileref <device-type> ‘external-name’ <host-options>;

fileref is any valid SAS name.

With Device-type you can read and write data from devices rather than files

’external-name’ is the name of the file on the host system. The quotes are required. Some of the values are CATALOG, DDE, EMAIL. DISK is the default device type.

host-options specify operating system specific options.

INPUT STATEMENT BASICS: 
The INPUT statement as we have seen in my previous post is used to tell SAS how to read the file. When an INPUT statement is executed, it reads from the file pointed to by the most recently executed INFILE statement, if the data step has more than one INFILE statements.

Variable Lengths: In my last post I discussed about various input methods like list, column, formatted, named. When deciding which method of input to use the length of required variables is an important consideration. SAS determines the length of a variable based on how it is referenced in the program, subject to certain defaults. Numeric variables (including dates) and character variables read by list or named input default to 8 bytes. Character variables read by formatted input or format modified list input are set to the length as specified by the informat. Character variables read by column input are set to the number of columns being read.

We can set the desired lengths of variables using LENGTH and ATTRIB statements which must be given before the INPUT statement.

Input Pointer Controls:

Though I have already discussed these in my previous post but it is required for the completion of topic.
SAS has two pointer controls, the column pointer and the line pointer, We can change these pointers to reread data, change the order in which data fields are read, or handle logical records that are defined by multiple physical records. Use of these pointer controls provide a lot of flexibility to the input statement. Following are the pointer controls in SAS :

1) @expression

2) +expression

@ moves the pointer to the column number specified after it.
+ moves the pointer left or right the number of columns resulting from expression.

3) #expression

4) /

# moves the pointer to a specific line number based on expression.
/ moves the pointer to the beginning of the next line. You can use multiple / to move more than one line.

Line Hold Specifiers: When we have multiple Input statements in the data step or we want to hold the current pointer position within iterations of the data step we use these specifiers. These are placed at the end of the INPUT statement.

They withhold SAS from reading a new record when the next INPUT statement is executed. This technique is used to read more complex files.

@ (trailing at-sign) tells SAS to keep this record current until either an INPUT is executed without a trailing @ or trailing @@, or until this iteration of the DATA step is completed.

@@ (double trailing at-sign) tells SAS to keep this record current through successive iterations of the DATA step.

NOTE : The difference between @ and @@ is a very important interview question and is asked very frequently ( Personal experience J )

INFILE STATEMENT OPTIONS: The INFILE statement has many options, some of which are listed below:

END= This variable becomes 1 when the current record is the last record in the file.

EOF= label defines a label to which the program will automatically branch when an 

INPUT statement tries to read past the end of the file.

FIRSTOBS=number specifies the first record to be read. The default value is 1.

OBS=record-number is the number of the last record to read.

Processing the end of records: The options below define what action SAS will take when the program attempts to move the column pointer beyond the end of a record.

FLOWOVER tells SAS to continue reading succeeding records until all variables in the INPUT statement have been read.

MISSOVER tells SAS to set remaining variables in the INPUT statement to missing.
STOPOVER tells SAS to immediately execute a STOP statement, which will stop the DATA step with _ERROR_ equal to 1.

TRUNCOVER tells SAS to salvage whatever it can from short records without going to the next record.

LENGTH=variable defines a variable that is set to the length of the current record. Even though the LENGTH= variable is defined by the INFILE, its value is controlled by the execution of the INPUT statement.

DELIMITER= identifies one or more delimiters to be used with list input. Either a variable name or a quoted constant may be specified. The default delimiter is a blank.

DSD enables you to read delimited files correctly when consecutive delimiters are present due to missing values. It also enables you to read quoted text, a common occurrence when reading files created on desktop systems.

Conclusion : This post as a sequel of my first post explains a host of options used in reading raw data which can be used with those techniques for better control over your reading data. This post is not just for reading but to be practised. SO guyz keep practising.

Will be back with some SAS Magic again. Till then Goodbye..!!


Saurabh Singh Chauhan  
Note : Comments and suggestions are always welcome.

Disclaimer :
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc.in the USA and other countries. ® Indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies. 
The contents of this post are the works of the author(s)and do not necessarily represent the opinions,recommendations, or practices of any organization whatsoever.

Saturday, January 28, 2012

READING EXTERNAL DATA FILES IN SAS


READING EXTERNAL DATA FILES IN SAS

PART 1 : Input methods

Objective : This article aims at discussing various Input methods used read raw data in a SAS dataset in an objective manner which is easy to understand for the user and also help the user understand when and which method is to be used.

Theory : In SAS reading raw data can be accomplished in various ways and which method to use depends purely on the type of data being imported into SAS.
There are four major methods of reading raw data files. These include:

   1)  List Input
   2)  Column input
   3)  Formatted Input
   4)  Named Input

A)  LIST INPUT :

List input sequentially reads  your input data lines for data values. Data values do not have to be aligned in columns, but they have to be separated from one another by a delimiter such as a space or other delimiter such as a comma (,). Space or blank is the default delimiter, to specify some othe delimiter use DLM= option in infile statement.

List input requires only that you specify the variable names to be assigned to the data values in your input data line.If you are creating the variable for the first time in the input statement (variable not defined in a length or attrib statement), a dollar sign ($) must follow the name of a character variable.
List input is the most basic form of data reading (of course after cards and datalines..!! ) so it has certain restrictions like :

         -         Data values MUST be separated by a blank or a delimiter
         -         A placeholder must specifically show a missing value wherever present(e.g.   
      period for numeric missing value)
         -         For reading a character value greater than 8 characters long you have to 
      specifically define the length in a LENGTH or ATTRIB statement. Etc etc…

Below is the sample code where an external delimited raw data file is read with list input. Assume that an external space delimited  file results.dat exists in the given directory.

data results;
 infile 'c:\myfiles\results.dat' DLM=’,’;
 input rollno name $ age subject1 subject2 subject3 subject4 ;
run;

Simple right ?
This is the most basic way of reading data and as data can be arranged in different ways we have other forms of input.

B) COLUMN INPUT
Column input reads standard data values which are aligned in columns. In a column input a column range is specified after the variable name, if it is a character then a $ follows the variable name which in turn is followed by the column range. Column input solves certain limitation of the above list input method:

         -          In column input missing values need not be specifically shown with a period or 
      consecutive delimiters, it just have to be missing..!!
         -          You can specify column names in input statement in any order you like.
         -          Character data values can have embedded delimiters.

But it has some limitations as well:

         -          Data values must be in same column for all data lines (That’s why it is called 
      column input right!!)
         -          Leading blanks are removed from the field.

Below is a small example for reading data through column input:

data results;
infile 'c:\myfiles\results.dat';
  input roll_no 1-6 name $ 9-20
         age 7-8 subject1 21-25
         subject2 26-30;
run;

C) FORMATTED INPUT

Unlike the above two cases where we are reading only standard data, formatted input provides us a way to read non-standard data. The @ and + pointers which we use in formatted input provide a lot of flexibility in reading data.
Some advantages of formatted input include:

         -          Character values can contain embedded blanks.
         -          Placeholders for missing values are not required.
         -          We can move through data in a direct manner (@) or a manner relative to 
      current position from where the data is being read(+) .
         -          Data values can be specified in any order in the input statement.

@ Pointer tells the pointer to move to the column specified after the @ sign.

+ Pointer tells the pointer to move further columns from the current position which is specified after the + sign.


data patients;
 infile 'c:\readdata\results.dat';
 input @9 name $12. @1 roll no 6.
       @7 age 2. @21 (subject1-subject3) (8. +1) ;
run;

D) MODIFIED LIST INPUT

Modified list input is a combination of list input and formatted input. Like list input, this style is restricted to reading variables in order. The data values do not have to be aligned in columns, but they do have to be separated from one another by a space or other delimiter such as a comma. Additionally, you can include informats that allow you to read more complex data values than you can read with simple list input.
In modified list input we use format modifiers to read complex data. There are three types of format modifiers that are used:

1) AMPERSAND (&) – This modifier specified after the variable name helps in reading data with embedded delimiters. SAS will stop reading the value when 2 consecutive delimiters are encountered.

2) TILDE (~) This character at the top left of your keyboard just below the Esc key tells SAS to read single and double quotation marks and delimiters as part of the data.

3) COLON (:) The colon modifier is used to specify informats for data values in some specific format like dollar8. The informat is specified right after the colon. SAS will read data until the delimiter is encountered or the width specified in the informat is reached.

An example using column input can be like :

data hospital;
infile datalines delimiter=',';
input pat_name : $20.
      Med_hist ~ $50.;
datalines;
George Smith,Had a long history of bronchitis
Wiber Stuart,Sensory organs stopped functioning at the age of 14
Rosy Mary,Had acute heart pain and nausea
;;;;
run;

Conclusion : The aim of this report was not to have an exhaustive pile of all concepts to read raw data files but to give a concise and easy understanding of options available for reading raw data files and their merits and demerits over each other.

Reading data is a very interesting concept in SAS and there are many challenges that you can face. So keep practicing with whatever data you get. Other types of data reading will be covered in a different post.

Will be back with some SAS Magic again. Till then Goodbye..!!


Saurabh Singh Chauhan
Note : Comments and suggestions are always welcome. 

Disclaimer :
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc.in the USA and other countries. ® Indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies. 
The contents of this post are the works of the author(s)and do not necessarily represent the opinions,recommendations, or practices of any organization whatsoever.