READING EXTERNAL DATA FILES IN SAS
PART 2 : Reading RAW data
Objective : This article aims at discussing
various ways of reading raw data in a SAS dataset in an objective manner which
is easy to understand for the user and also help the user understand when which
method is to be used.
Theory : We SAS programmers are so obsessed
to working with datasets that we generally ignore the fact that most of the
real world data is not in SAS datasets. The types of data storage include
legacy mainframe files, client/server RDBMS, and simple text files on computers.
While we want to use SAS to examine, analyze and display this data, SAS datasets
are . So we have to convert the data to SAS data sets from its native form.
POINTING
SAS TO THE EXTERNAL FILE :There are two ways to point SAS to external files:-
1) FILENAME
statements
2) INFILE
statements.
INFILE
Statement:
INFILE
statement is used to tell a DATA step from which file to read when an INPUT
statement is executed. If the argument after the keyword ‘INFILE’ is quoted,
the SAS Supervisor treats it as an actual external file name. Otherwise, the SAS
Supervisor assumes that it is a fileref. An INFILE statement should be used in
each data step reading an external file and must be executed before the input
statement. You can have more than one INFILE statement, which allows you to
read multiple files with a single DATA step.
If the
INFILE statement is absent from a DATA step, the INPUT statement will default
to INFILE CARDS.
When SAS
begins each iteration of a DATA step, it "forgets" from which file(s)
it had read. This means that you must be careful to execute an INFILE for each
DATA step that must reference an external file.
The syntax
of INFILE statement is:
INFILE
file-spec <options> <host options>;
file-spec
identifies the source of the data.file-spec may have 3 forms:
1) fileref :-
fileref assigned to the required external file. The fileref must be assigned before
the DATA step by using a FILENAME statement.
2) ‘external-file’
:- specifies the complete path and name of the required external file. This
form is equivalent to specifying the external file with a FILENAME statement.
3) CARDS or DATALINES :- indicate that the data immediately follows the CARDS or DATALINES statement at the end of the data step.
options
specify SAS options to control reading the file or to provide information about
the file.
FILENAME
Statement:
This
statement creates a file reference that links a physical file with the SAS
system. The FILENAME statement should be executed within a SAS session prior to
the DATA step in which the fileref is used.
The syntax
of the FILENAME statement is:
FILENAME
fileref <device-type> ‘external-name’ <host-options>;
fileref is
any valid SAS name.
With Device-type
you can read and write data from
devices rather than files
’external-name’
is the name of the file on the host system. The quotes are required. Some of
the values are CATALOG, DDE, EMAIL. DISK is the default device type.
host-options
specify operating system specific options.
INPUT
STATEMENT BASICS:
The INPUT statement as we have seen in my previous post is used to tell SAS
how to read the file. When an INPUT statement is executed, it reads from the file
pointed to by the most recently executed INFILE statement, if the data step has
more than one INFILE statements.
Variable
Lengths:
In
my last post I discussed about various input methods like list, column,
formatted, named. When
deciding which method of input to use the length of required variables is an
important consideration. SAS determines the length of a variable based on how
it is referenced in the program, subject to certain defaults. Numeric variables
(including dates) and character variables read by list or named input default
to 8 bytes. Character variables read by formatted input or format modified list
input are set to the length as specified by the informat. Character variables
read by column input are set to the number of columns being read.
We can set
the desired lengths of variables using LENGTH and ATTRIB statements which must
be given before the INPUT statement.
Input
Pointer Controls:
Though I have
already discussed these in my previous post but it is required for the
completion of topic.
SAS has two
pointer controls, the column pointer and
the line pointer, We can change
these pointers to reread data, change the order in which data fields are read,
or handle logical records that are defined by multiple physical records. Use of
these pointer controls provide a lot of flexibility to the input statement.
Following are the pointer controls in SAS :
1)
@expression
2)
+expression
@
moves the pointer to the column
number specified after it.
+
moves the pointer left or
right the number of columns resulting from expression.
3)
#expression
4)
/
#
moves the pointer to a
specific line number based on expression.
/
moves the pointer to the
beginning of the next line. You can use multiple / to move more than one line.
Line
Hold Specifiers:
When we have multiple Input statements in the data step or we want to hold the
current pointer position within iterations of the data step we use these
specifiers. These are placed at the end of the INPUT statement.
They withhold
SAS from reading a new record when the next INPUT statement is executed. This technique
is used to read more complex files.
@
(trailing at-sign) tells
SAS to keep this record current until either an INPUT is executed without a
trailing @ or trailing @@, or until this iteration of the DATA step is
completed.
@@
(double trailing at-sign) tells SAS to keep this record current through successive iterations of
the DATA step.
NOTE : The
difference between @ and @@ is a very important interview question and is asked
very frequently ( Personal experience J
)
INFILE
STATEMENT OPTIONS: The INFILE
statement has many options, some of which are listed below:
END=
This variable becomes
1 when the current record is the last record in the file.
EOF=
label defines a label to
which the program will automatically branch when an
INPUT statement
tries to read past the end of the file.
FIRSTOBS=number specifies the first record to be
read. The default value is 1.
OBS=record-number is the number of the
last record to read.
Processing
the end of records: The
options below define what action SAS will take when the program attempts to
move the column pointer beyond the end of a record.
FLOWOVER
tells SAS to continue
reading succeeding records until all variables in the INPUT statement have been
read.
MISSOVER
tells SAS to set
remaining variables in the INPUT statement to missing.
STOPOVER
tells SAS to immediately
execute a STOP statement, which will stop the DATA step with _ERROR_ equal to
1.
TRUNCOVER
tells SAS to salvage
whatever it can from short records without going to the next record.
LENGTH=variable defines a variable that is
set to the length of the current record. Even though the LENGTH= variable is
defined by the INFILE, its value is controlled by the execution of the INPUT
statement.
DELIMITER=
identifies one or more
delimiters to be used with list input. Either a variable name or a quoted
constant may be specified. The default delimiter is a blank.
DSD
enables you to read
delimited files correctly when consecutive delimiters are present due to missing
values. It also enables you to read quoted text, a common occurrence when
reading files created on desktop systems.
Conclusion : This post as a sequel of my first
post explains a host of options used in reading raw data which can be used with
those techniques for better control over your reading data. This post is not
just for reading but to be practised. SO guyz keep practising.
Will be
back with some SAS Magic again. Till then Goodbye..!!
Saurabh Singh
Chauhan
Note : Comments and suggestions are always welcome.
Disclaimer :
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc.in the USA and other countries. ® Indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies. The contents of this post are the works of the author(s)and do not necessarily represent the opinions,recommendations, or practices of any organization whatsoever.
Disclaimer :
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc.in the USA and other countries. ® Indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies. The contents of this post are the works of the author(s)and do not necessarily represent the opinions,recommendations, or practices of any organization whatsoever.
No comments:
Post a Comment