Sunday, January 29, 2012

READING EXTERNAL DATA FILES IN SAS (Part 2)


READING EXTERNAL DATA FILES IN SAS
PART 2 : Reading RAW data

Objective : This article aims at discussing various ways of reading raw data in a SAS dataset in an objective manner which is easy to understand for the user and also help the user understand when which method is to be used.

Theory : We SAS programmers are so obsessed to working with datasets that we generally ignore the fact that most of the real world data is not in SAS datasets. The types of data storage include legacy mainframe files, client/server RDBMS, and simple text files on computers. While we want to use SAS to examine, analyze and display this data, SAS datasets are . So we have to convert the data to SAS data sets from its native form.

POINTING SAS TO THE EXTERNAL FILE :There are two ways to point SAS to external files:-
1) FILENAME statements
2) INFILE statements.

INFILE Statement:

INFILE statement is used to tell a DATA step from which file to read when an INPUT statement is executed. If the argument after the keyword ‘INFILE’ is quoted, the SAS Supervisor treats it as an actual external file name. Otherwise, the SAS Supervisor assumes that it is a fileref. An INFILE statement should be used in each data step reading an external file and must be executed before the input statement. You can have more than one INFILE statement, which allows you to read multiple files with a single DATA step.

If the INFILE statement is absent from a DATA step, the INPUT statement will default to INFILE CARDS.

When SAS begins each iteration of a DATA step, it "forgets" from which file(s) it had read. This means that you must be careful to execute an INFILE for each DATA step that must reference an external file.
The syntax of INFILE statement is:

INFILE file-spec <options> <host options>;

file-spec identifies the source of the data.file-spec may have 3 forms:

1) fileref :- fileref assigned to the required external file. The fileref must be assigned before the DATA step by using a FILENAME statement.

2) ‘external-file’ :- specifies the complete path and name of the required external file. This form is equivalent to specifying the external file with a FILENAME statement.

3) CARDS or DATALINES :- indicate that the data immediately follows the CARDS or DATALINES statement at the end of the data step.
options specify SAS options to control reading the file or to provide information about the file.

FILENAME Statement:

This statement creates a file reference that links a physical file with the SAS system. The FILENAME statement should be executed within a SAS session prior to the DATA step in which the fileref is used.

The syntax of the FILENAME statement is:

FILENAME fileref <device-type> ‘external-name’ <host-options>;

fileref is any valid SAS name.

With Device-type you can read and write data from devices rather than files

’external-name’ is the name of the file on the host system. The quotes are required. Some of the values are CATALOG, DDE, EMAIL. DISK is the default device type.

host-options specify operating system specific options.

INPUT STATEMENT BASICS: 
The INPUT statement as we have seen in my previous post is used to tell SAS how to read the file. When an INPUT statement is executed, it reads from the file pointed to by the most recently executed INFILE statement, if the data step has more than one INFILE statements.

Variable Lengths: In my last post I discussed about various input methods like list, column, formatted, named. When deciding which method of input to use the length of required variables is an important consideration. SAS determines the length of a variable based on how it is referenced in the program, subject to certain defaults. Numeric variables (including dates) and character variables read by list or named input default to 8 bytes. Character variables read by formatted input or format modified list input are set to the length as specified by the informat. Character variables read by column input are set to the number of columns being read.

We can set the desired lengths of variables using LENGTH and ATTRIB statements which must be given before the INPUT statement.

Input Pointer Controls:

Though I have already discussed these in my previous post but it is required for the completion of topic.
SAS has two pointer controls, the column pointer and the line pointer, We can change these pointers to reread data, change the order in which data fields are read, or handle logical records that are defined by multiple physical records. Use of these pointer controls provide a lot of flexibility to the input statement. Following are the pointer controls in SAS :

1) @expression

2) +expression

@ moves the pointer to the column number specified after it.
+ moves the pointer left or right the number of columns resulting from expression.

3) #expression

4) /

# moves the pointer to a specific line number based on expression.
/ moves the pointer to the beginning of the next line. You can use multiple / to move more than one line.

Line Hold Specifiers: When we have multiple Input statements in the data step or we want to hold the current pointer position within iterations of the data step we use these specifiers. These are placed at the end of the INPUT statement.

They withhold SAS from reading a new record when the next INPUT statement is executed. This technique is used to read more complex files.

@ (trailing at-sign) tells SAS to keep this record current until either an INPUT is executed without a trailing @ or trailing @@, or until this iteration of the DATA step is completed.

@@ (double trailing at-sign) tells SAS to keep this record current through successive iterations of the DATA step.

NOTE : The difference between @ and @@ is a very important interview question and is asked very frequently ( Personal experience J )

INFILE STATEMENT OPTIONS: The INFILE statement has many options, some of which are listed below:

END= This variable becomes 1 when the current record is the last record in the file.

EOF= label defines a label to which the program will automatically branch when an 

INPUT statement tries to read past the end of the file.

FIRSTOBS=number specifies the first record to be read. The default value is 1.

OBS=record-number is the number of the last record to read.

Processing the end of records: The options below define what action SAS will take when the program attempts to move the column pointer beyond the end of a record.

FLOWOVER tells SAS to continue reading succeeding records until all variables in the INPUT statement have been read.

MISSOVER tells SAS to set remaining variables in the INPUT statement to missing.
STOPOVER tells SAS to immediately execute a STOP statement, which will stop the DATA step with _ERROR_ equal to 1.

TRUNCOVER tells SAS to salvage whatever it can from short records without going to the next record.

LENGTH=variable defines a variable that is set to the length of the current record. Even though the LENGTH= variable is defined by the INFILE, its value is controlled by the execution of the INPUT statement.

DELIMITER= identifies one or more delimiters to be used with list input. Either a variable name or a quoted constant may be specified. The default delimiter is a blank.

DSD enables you to read delimited files correctly when consecutive delimiters are present due to missing values. It also enables you to read quoted text, a common occurrence when reading files created on desktop systems.

Conclusion : This post as a sequel of my first post explains a host of options used in reading raw data which can be used with those techniques for better control over your reading data. This post is not just for reading but to be practised. SO guyz keep practising.

Will be back with some SAS Magic again. Till then Goodbye..!!


Saurabh Singh Chauhan  
Note : Comments and suggestions are always welcome.

Disclaimer :
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc.in the USA and other countries. ® Indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies. 
The contents of this post are the works of the author(s)and do not necessarily represent the opinions,recommendations, or practices of any organization whatsoever.