Saturday, January 28, 2012

READING EXTERNAL DATA FILES IN SAS


READING EXTERNAL DATA FILES IN SAS

PART 1 : Input methods

Objective : This article aims at discussing various Input methods used read raw data in a SAS dataset in an objective manner which is easy to understand for the user and also help the user understand when and which method is to be used.

Theory : In SAS reading raw data can be accomplished in various ways and which method to use depends purely on the type of data being imported into SAS.
There are four major methods of reading raw data files. These include:

   1)  List Input
   2)  Column input
   3)  Formatted Input
   4)  Named Input

A)  LIST INPUT :

List input sequentially reads  your input data lines for data values. Data values do not have to be aligned in columns, but they have to be separated from one another by a delimiter such as a space or other delimiter such as a comma (,). Space or blank is the default delimiter, to specify some othe delimiter use DLM= option in infile statement.

List input requires only that you specify the variable names to be assigned to the data values in your input data line.If you are creating the variable for the first time in the input statement (variable not defined in a length or attrib statement), a dollar sign ($) must follow the name of a character variable.
List input is the most basic form of data reading (of course after cards and datalines..!! ) so it has certain restrictions like :

         -         Data values MUST be separated by a blank or a delimiter
         -         A placeholder must specifically show a missing value wherever present(e.g.   
      period for numeric missing value)
         -         For reading a character value greater than 8 characters long you have to 
      specifically define the length in a LENGTH or ATTRIB statement. Etc etc…

Below is the sample code where an external delimited raw data file is read with list input. Assume that an external space delimited  file results.dat exists in the given directory.

data results;
 infile 'c:\myfiles\results.dat' DLM=’,’;
 input rollno name $ age subject1 subject2 subject3 subject4 ;
run;

Simple right ?
This is the most basic way of reading data and as data can be arranged in different ways we have other forms of input.

B) COLUMN INPUT
Column input reads standard data values which are aligned in columns. In a column input a column range is specified after the variable name, if it is a character then a $ follows the variable name which in turn is followed by the column range. Column input solves certain limitation of the above list input method:

         -          In column input missing values need not be specifically shown with a period or 
      consecutive delimiters, it just have to be missing..!!
         -          You can specify column names in input statement in any order you like.
         -          Character data values can have embedded delimiters.

But it has some limitations as well:

         -          Data values must be in same column for all data lines (That’s why it is called 
      column input right!!)
         -          Leading blanks are removed from the field.

Below is a small example for reading data through column input:

data results;
infile 'c:\myfiles\results.dat';
  input roll_no 1-6 name $ 9-20
         age 7-8 subject1 21-25
         subject2 26-30;
run;

C) FORMATTED INPUT

Unlike the above two cases where we are reading only standard data, formatted input provides us a way to read non-standard data. The @ and + pointers which we use in formatted input provide a lot of flexibility in reading data.
Some advantages of formatted input include:

         -          Character values can contain embedded blanks.
         -          Placeholders for missing values are not required.
         -          We can move through data in a direct manner (@) or a manner relative to 
      current position from where the data is being read(+) .
         -          Data values can be specified in any order in the input statement.

@ Pointer tells the pointer to move to the column specified after the @ sign.

+ Pointer tells the pointer to move further columns from the current position which is specified after the + sign.


data patients;
 infile 'c:\readdata\results.dat';
 input @9 name $12. @1 roll no 6.
       @7 age 2. @21 (subject1-subject3) (8. +1) ;
run;

D) MODIFIED LIST INPUT

Modified list input is a combination of list input and formatted input. Like list input, this style is restricted to reading variables in order. The data values do not have to be aligned in columns, but they do have to be separated from one another by a space or other delimiter such as a comma. Additionally, you can include informats that allow you to read more complex data values than you can read with simple list input.
In modified list input we use format modifiers to read complex data. There are three types of format modifiers that are used:

1) AMPERSAND (&) – This modifier specified after the variable name helps in reading data with embedded delimiters. SAS will stop reading the value when 2 consecutive delimiters are encountered.

2) TILDE (~) This character at the top left of your keyboard just below the Esc key tells SAS to read single and double quotation marks and delimiters as part of the data.

3) COLON (:) The colon modifier is used to specify informats for data values in some specific format like dollar8. The informat is specified right after the colon. SAS will read data until the delimiter is encountered or the width specified in the informat is reached.

An example using column input can be like :

data hospital;
infile datalines delimiter=',';
input pat_name : $20.
      Med_hist ~ $50.;
datalines;
George Smith,Had a long history of bronchitis
Wiber Stuart,Sensory organs stopped functioning at the age of 14
Rosy Mary,Had acute heart pain and nausea
;;;;
run;

Conclusion : The aim of this report was not to have an exhaustive pile of all concepts to read raw data files but to give a concise and easy understanding of options available for reading raw data files and their merits and demerits over each other.

Reading data is a very interesting concept in SAS and there are many challenges that you can face. So keep practicing with whatever data you get. Other types of data reading will be covered in a different post.

Will be back with some SAS Magic again. Till then Goodbye..!!


Saurabh Singh Chauhan
Note : Comments and suggestions are always welcome. 

Disclaimer :
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc.in the USA and other countries. ® Indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies. 
The contents of this post are the works of the author(s)and do not necessarily represent the opinions,recommendations, or practices of any organization whatsoever.