Data Transformation: RECODE

Reading: SPSS Base 9.0 User's Guide: Chapter 5, Data Transformations, pp. 86-90, 94-95               
Homework: recode
Download: recode.sav        (Download Tips)

  1. Overview
  2. Select the Recode Transformation
  3. Recode into the Same Variable
  4. Recode into a Different Variable
  5. Recode a Range of Values
  6. The keywords MISSING and SYSMIS
  7. The keywords ELSE and COPY
  8. When to use MISSING VALUES after a RECODE
  9. Recoding String Variables

1. Overview

The recode transformation allows you to change, or recode, one or more values. You can either change the values within a variable or create a new variable with the recoded values.

Suppose you have used an "Attitudes Towards Capital Punishment" (ATCP) scale in your research (L. L. Thurstone, printed in Shaw, M. E., & Wright, J. M. Scales for the Measurement of Attitudes, New York: McGraw- Hill, 1967). Four of the 24 items in ATCP are included in the file recode.sav. The variables in recode.sav are shown in Table 1, the values are shown in Table 2.

Table 1. Variables in the recode.sav Data File
Variable Variable Label / Value Labels / Missing Values
id  
age Age of respondent /
Missing Value = 99
atcp01 Capital punishment is absolutely never justified /
  1 'strongly disagree'
  2 'disagree'
  3 'neutral'
  4 'agree'
  5 'strongly agree'
  9 'I don't understand the question' /
  Missing Value = 9
atcp02 I think capital punishment is necessary but I wish it were not/ 
 same value labels and missing value as atcp01
atcp03 Any person, man or woman, young or old, who commits murder should pay with his own life /
    same value labels and missing value as atcp01
atcp04 We can't call ourselves civilized as long as we have capital punishment /
    same value labels and missing value as atcp01

 

Table 2. The Values in recode.sav.
 id age ses atcp01 atcp02 atcp03 atcp04
1
2
3
4
5
6
7
8
27
21
18
99
47
.
23
65
Middle
Upper
 Lower

Middle
 Lower
Upper
 Lower
5
1
1
5
3
4
.
9
1
5
9
5
9
1
.
4
1
5
9
4
3
.
.
2
5
1
9
3
5
9
.
9

The values for the first two cases represent participants who have opposite opinions about capital punishment. The first participant is opposed to capital punishment and strongly agrees with these items: ATCP01: Capital punishment is absolutely never justified and ATCP04: We can't call ourselves civilized as long as we have capital punishment; and strongly disagrees with these items: ATCP03: Any person, man or woman, young or old, who commits murder, should pay with his own life and ATCP02: I think capital punishment is necessary but I wish it were not. The second participant is in favor of capital punishment and has opposite opinions on those four items. If you attempt to find each person's attitude towards capital punishment by summing their responses to the four items, you get a score of 12 for each person. The first two cases have opposite opinions about capital punishment, but their summed scores (or their mean scores) are identical. This happens because of the wording of the items on the scale. The scale was constructed so that agreement with some of items indicates a favorable attitude towards capital punishment (e.g., ATCP02 and ATCP03) and agreement with other items indicates an unfavorable attitude towards capital punishment (e.g., ATCP01 and ATCP04). This technique of reversing the wording on the scales is commonly used to control for response biases in the way that people respond to the scales.

Suppose that you want to score the scale so that a high score indicates that the person is in favor of capital punishment. Items ATCP02 and ATCP03 are o.k., agreement with those items indicates a favorable attitude towards capital punishment. But what about ATCP01 and ATCP04? A person who is strongly in favor of capital punishment would disagree with those items. The problem can be solved by reversing the scoring for ATCP01 and ATCP04 so that a high score on those items means disagreement with the content of the item rather than agreement. Here is a summary of the value labels, the old values (prior to recoding) and new values (after recoding):

Table 3. Summary of recoded values
value label old value
(before recode)
new value
(after recode)
strongly disagree 1 5
disagree 2 4
neutral 3 3
agree 4 2
strongly agree 5 1

The recode command is used to reverse the scoring. After recoding the values for ATCP01 and ATCP04 you can again find the sum or mean for all the items in the scale. After recoding, a person who was favorable towards capital punishment would get a high score on the sum of the items and a person who was unfavorable towards capital punishment would get a low score on the sum of the items.

top


2. Select the Recode Transformation

The recode dialog box is opened by clicking

Transform
   Recode
          Into Same Variables ...
          Into Different Variables ...

The "Into Same Variables..." option will recode the values within the variables. The "Into Different Variables..." option will place the results of the recode into a different variable.

top


3. Recode into the Same Variable

If you select the "Recode Into Same Variables.." option, the dialog box displays the names of the variables in the left box. The variables to be recoded are moved to the Variables: box. Once variable has been moved to the Variables: box the Old and New Values button at the bottom of the dialog box becomes active.

The If.. button allows for conditional transformations, a topic of a later set of notes.

In this example we wish to recode ATCP01 and ATCP04 so we move them to the Variables: box and then click on the Old and New Values... button. The left side of the new dialog box defines the old values, the values to be recoded. The right side of the new dialog box defines the new variables and rules used for recoding the values.

You can recode values one at a time by entering a single value to be recoded in the Old-Value Value: window and the result of the recoding in the New-Value Value: window. We want to make the following recodes for ATCO01 and ATCP04: 1 --> 5, 2 --> 4, 3 --> 3, 4 --> 2, and 5 --> 1. The first recode is done by entering the value 1 as the old value, then entering the value 5 as the new value, and then pressing the Add button. Repeat that sequence for each of the other values to be recoded. When you are recoding into the same variable you do not need to enter the recode when the old value equals the new value, e.g., 3 --> 3. Go ahead and make those recodes and save the recode command in the syntax editor. The syntax for the recode command looks like this

RECODE atcp01 atcp04 (1=5) (2=4) (4=2) (5=1) .

Values that are not explicitly mentioned in the RECODE command are not changed. In this example the "3"s and "9"s were not mentioned so they were not recoded. The 9s remain as user-missing values.

The original data and the recoded data are shown in Table 4. The variable atcp1_4 is the sum of the four attitude items. It was created by the following compute transformation

COMPUTE atcp1_4 = sum.4(atcp01 to atcp04) .

Note that the compute transformation would need to be rerun after running the recode transformation..

Table 4.Values Before and After Recoding.
  Values Before Recoding Values After Recoding
 id atcp01 atcp02 atcp03 atcp04 atcp1_4 atcp01 atcp02 atcp03 atcp04 atcp1_4
1
2
3
4
5
6
7
8
5
1
1
5
3
4
.
9
1
5
9
5
9
1
.
4
1
5
9
4
3
.
.
2
5
1
9
3
5
9
.
9
12
12
.
17
.
.
.
.
1
5
5
1
3
2
.
9
1
5
9
5
9
1
.
4
1
5
9
4
3
.
.
2
1
5
9
3
1
9
.
9
4
20
.
13
.
.
.
.

 

CAUTION:  You have run the recode transformation. What do you suppose would happen if you went to the syntax editor and reran the recode command again? And again? And Again? When would you become confused about whether the values of atcp01 and atcp04 were the original values or the recoded values?

What if you run the recode, save the data, come back later, read in the saved data, and run the recode again?

You need to be very careful when you recode into the same variable. It is easy to loose track of what you are doing.

top


4. Recode into a Different Variable

The previous recode changed the values of atcp01 and atcp04 in active file, i.e., the file in the data editor. Unless you save the contents of the data editor using the same file name, the original values remain in the .sav file. Read in the original values from recode.sav again.

This time we will recode the variables using the "Into Different Variables..." option. Select the variables to be recoded (atcp01 and atcp04) and move them to the Input Variable->Output Variable: box. Select the variable atcp01 -> ?. The Output Variable  Name: and Label: boxes will open. Enter a name for the new variable, say atcp01r. You can use any label you wish. You could enter the recode values, e.g., 1=5, 2=4, 3=3, 4=2, 5=1. Or the label could state that the variable was the "recoded value of atcp01." There is no option to automatically add a label as there was in the compute transformation dialog box. Then enter name and label for the other variable.

Next, go to the Old and New Values... dialog box and enter the values as we did in the previous section. The time you will need to enter the recode for 3, 3 --> 3.

The syntax for this recode uses the key word "into" to define the new variable(s) -

RECODE
atcp01 atcp04
(1=5)(2=4)(3=3)(4=2)(5=1)INTO atcp01r atcp04r .

The results of the recode of first input variable (actp01) will be placed in the first output variable (atcp01r). Similarly, the results of the recode of the second input variable (atcp04) will be placed in the second output variable (atcp04r), and so forth.

After running the recode command the new variables, atcp01r and atcp04r will appear in the data editor. You can now compute the sum using the recoded variables -

COMPUTE atcp1_4 = sum.4(atcp01r, atcp02, atcp03, atcp04r).

The data in the Data Editor should now look like the data in Table 5.

Table 5. The Values in recode.sav.
 id age ses atcp01 atcp02 atcp03 atcp04 atcp01r atcp04r atcp1_4
1
2
3
4
5
6
7
8
27
21
18
99
47
.
23
65
Middle
Upper
 Lower

Middle
 Lower
Upper
 Lower
5
1
1
5
3
4
.
9
1
5
9
5
9
1
.
4
1
5
9
4
3
.
.
2
5
1
9
3
5
9
.
9
1.00
5.00
5.00
1.00
3.00
2.00
.
.
1.00
5.00
.
3.00
1.00
.
.
.
4.00
20.00
.
13.00
.
.
.
.

Because atcp01r and atcp04r are new variables, they have no value labels and no user missing values (see the SPSS Variables utility to check out the variable names, variable types, labels, and missing values). If you wish you can add value labels and missing values using the Define Variables... dialog box.

The default variable type for the newly created numeric variables, atcp01r, atcp04r, and atcp1_4, is numeric 8.2. Furthermore, whenever you create a new numeric variable, all of its values are initially set to system missing.

Recall that values that are not explicitly mentioned in the RECODE command are not changed. That rule still applies when you are recoding into a different variable. Any old value that is not explicitly mentioned is not recoded, therefore the corresponding value in the new output variable remains as system missing. In this example the 9's (the user-defined missing values) were not recoded so all 9's are system missing in the newly created variables atcp01r and atcp04r. Similarly, the old system missing values were not explicitly recoded so the result remains at system missing in the newly created output variables.

(Technically, the recode outcome variable does not have to be a new variable, although in most instances you will want to create a new variable. If the outcome variable already exists in the data editor, then what happens if an old value is not mentioned in the recode? The corresponding value in the outcome variable is not changed.)

top


5. Recode a Range of Values

The previous example recoded values on a one-to-one basis. The keywords lowest, highest, and through can be used to recode a range of values.

You should pay attention to what happens to user- and system-missing values in these examples.

Suppose you wanted to know if people below the median age responded differently than people at or above the median age. You could recode the age variable into two levels, those at or below the median age and those above median age and use the recoded variable in a t-test or an analysis of variance. Again, suppose the median age was 28.

Go to the "Into Different Values..." recode option. Move age into the Numeric Variable-> Output Variable: box. Name the output variable "mdnage" and label it as "Median age." Then go to the Old and New Values... dialog box. There are three range options in the Old Value section: range; range lowest through...; and range ...through highest. Click on the Range Lowest through... radio button and enter the value 28, the median age. Enter "1" as the New Value and Add it to the Old --> New list. Next click on the Range ... through highest radio button and enter the median age again, 28. Enter "2" as the New Value and Add it to the Old-->New list. The syntax for this recode would be --

RECODE  age (Lowest thru 28=1) (28 thru Highest=2) INTO mdnage .

Why did we use 28 as the value for ... through highest rather than 29? The first recode, "lowest through 28." recoded all the values from the lowest value up to and including the value 28 as "1s." The recode "29 through highest" would recode all the values from 29 up to the highest value as "2s." If there had been any fractional values between 28 and 29 they would not be recoded, e.g. an age of 28.1 would not have been recoded. In this particular set of data there were no fractional values so either "28 through highest" or "29 through highest" would have worked. What happens to the value "28" when you apply the recodes "lowest through 28" and "28 through highest?" SPSS will only recode a value once. So the second time the value "28" is referenced (as in 28 through highest) it is ignored. It is best to get used to overlapping the range values in this manner so that none of the values slip through the cracks and are not recoded.

Just to make sure you understand the concept... Suppose you added the recodes in the following order-

28 through highest --> 2
lowest through 28 --> 1

RECODE  age (28 thru Highest=2) (Lowest thru 28=1) INTO mdnage .

Would 28 be recoded as a "1" or as a "2?"

The default format of the new numeric variable mdnage is numeric 8.2.

Missing values. Look what happens to the user defined missing value in this recode. It is recoded as a 2! When using a range of values, all values within that range (including user missing values) will be recoded. In this case we do not want the "99" to be recoded as a "2" because we do not know the age of that person. The way to get around this problem is define the range so that the user missing value is not included in that range. Go back to the recode dialog box. Check the radio box "range ... through ... ." Enter "28" before the "through" and "98" after the through. Enter "2" as the New Value and Add it to the Old --> New: box. If you still have the recode "28 through highest -->2" showing, then highlight it and Remove it from the recode list. Press Continue, save the syntax if you wish, and then click OK to run this recode transformation. The syntax  command would look like this --

RECODE  age (Lowest thru 28=1) (28 thru 98=2) INTO mdnage .

If you set the variable type for mdnage to F1.0, then the values for age and mdnage should be the same as those shown in Table 6.

Table 6. The age and mdnage Values.
 id  age  mdnage
1
2
3
4
5
6
7
8
27
21
18
99
47
.
23
65
1
1
1
.
2
.
1
2

The data in your Data Editor may not be exactly the same as the data in Table 6.   The mdnage value for case #4 may still be a 2.  Why is that?  Because we recoded into an existing variable, mdnage.  Any values that that were not mentioned in the new recode were not changed. We explicitly excluded 99 from consideration so the old value of 2 for case #4 was not changed.  If you make a mistake when recoding into a new variable is it best to delete that new variable and rerun the recode. Try it.

Here is a summary of what happens to values that are not explicitly recoded and to user- and system-missing values when a recode involves a range of values.

Table 7. What Happens When a Range of Values is Used in a Recode?

What happens to

When using
"Into Same Variables..."

When using
"Into Different Variables..."

A value that is not explicitly recoded. The value remains unchanged. The value remains unchanged.  The result is system missing if a new variable is created by the recode (new numeric variables are initialized to system missing).  If  you are recoding into an existing variable the old value is not changed.
A system-missing value A system missing value cannot be included in a range of values. It remains system missing. A system missing value cannot be included in a range of values. The result is system missing if a new variable is created by the recode.
A user-missing value. It is recoded if it is included in a range or if the value is recoded.

Caution:  the missing value flag is not carried over into the recoded value.  The recoded value will be valid unless you explicitly define it as missing.

It is recoded if it is included in a range or if the value is explicitly recode. 

Caution:  note that missing value flags are not carried over into the different variable.  If you are creating a new variable, you will need to define of the user missing values in that new variable.

top


6. The keywords MISSING and SYSMIS

The keywords MISSING and SYSMIS were previously encountered in the notes on the compute transformation. MISSING refers to both user- and system-missing values. SYSMIS refers to system missing values only.

Consider the following recode syntax

RECODE atcp01 (MISSING = -99) INTO item1.
RECODE atcp02 (SYSMIS = -99) INTO item2.
RECODE atcp03 (9 = SYSMIS) INTO item3.
MISSING VALUES item1, item2 (-99).
EXECUTE.

Hint: an easy way to run these recodes is to copy and paste them into the syntax editor and then run them all at once.

The first example, RECODE atcp01 (MISSING = -99) INTO item1, recodes all missing values into -99. In the dialog box select the Old Value radio button for System- or user-missing.

The second example, RECODE atcp02 (SYSMIS = -99) INTO item2, codes system missing values into -99. In the dialog box select the Old Value radio button for System-missing.

The third example, RECODE atcp03 (9 = SYSMIS) INTO item3, changes the user-missing value 9 into system missing values. In the dialog box select New Value radio button for System-missing.

Important: the -99s in the new variables item1 and item2 will be considered to be valid values unless you explicitly define the those values as user missing. You can define user missing values in the define variables... dialog box. Or you can define them as user-missing by running the syntax command MISSING VALUES.

The results of those recode transformations are shown in Table 8.

Table 8. The Keywords Missing and Sysmis.
 id atcp01 atcp02 atcp03 atcp04 item1 item2 item3
1
2
3
4
5
6
7
8
5
1
1
5
3
4
.
9
1
5
9
5
9
1
.
4
1
5
9
4
3
.
.
2
5
1
9
3
5
9
.
9
.
.
.
.
.
.
-99
-99
.
.
.
.
.
.
-99
.
.
.
.
.
.
.
.
.

top


7. The keywords ELSE and COPY

The ELSE keyword refers to every value that has not been previously specified in the recode for a variable. The ELSE keyword should be the last element of a recode. The COPY keyword says that the input values should be copied into the output variable.

Consider the following recode commands -

RECODE   atcp01 (MISSING=-99) (ELSE=COPY)  INTO item4  
       / atcp02 (SYSMIS=-99)  (ELSE=COPY)  INTO item5
       / atcp03 (9=SYSMIS) (1 THRU 5=COPY) INTO item6.
MISSING VALUES item4 (-99)/ item5 (9,-99). 

The syntax for recode allows you to recode several different variables if you separate each of the recodes from each other by a slash (/). The slash prior to atcp02 indicates that this is the start of a new set of recode rules, as does the slash prior to atcp03.

The syntax for missing values also allows you to assign different missing values for different variables as long as the each variable set is separated from the other sets by a slash. In the MISSING VALUES command the slash prior to item2 indicates a new set of missing values.

The first recode, atcp01 (MISSING=-99) (ELSE=COPY) INTO item4, assigns the value -99 to both user- and system-missing values and then copies the all other values into the output variable, item4. The MISSING VALUES command defines -99 as a user-missing value for item4.

The second recode, atcp02 (SYSMIS=-99) (ELSE=COPY) INTO item5, changes system missing values to -99, and copies every other value into the output variable, item5. The MISSING VALUES command defines both 9 and -99 as user-missing values for item5.

The third recode, atcp03 (9=SYSMIS) (1 THRU 5=COPY) INTO item6, changes the old user-missing values of 9 to system missing values and then copies all other values into the output variable, item6. There are no user-defined missing values in item6, they have all been changed to system missing values.

Again it is very important to note that the 9s in item4 and the 9s and -99s in item5 will all be valid values if you do not define them as user-missing. You can define user missing values by running the syntax command MISSING VALUES or you can define user-missing values in the Define Variables... dialog box.

The results of the recodes in this section are shown in Table 9.

Table 9. The Keywords Missing and Sysmis.
 id atcp01 atcp02 atcp03 atcp04 item4 item5 item6
1
2
3
4
5
6
7
8
5
1
1
5
3
4
.
9
1
5
9
5
9
1
.
4
1
5
9
4
3
.
.
2
5
1
9
3
5
9
.
9
5
1
1
5
3
4
-99
-99
1
5
9
5
9
1
-99
4
1
5
.
4
3
.
.
2

top


8. When to use MISSING VALUES after a RECODE

    Q -  When should you redefine MISSING VALUES for recoded variables?
    A -  When RECODE has changed the user-defined missing values, or
         when RECODE has generated a new variable which has user-
         defined missing values.
CAUTION: Remember that user missing values are defined by the MISSING VALUES command. If you define a new variable then you must use MISSING VALUES to specify any user missing values for that new variable. Similarly, if you change the user missing value for a variable, then you must use MISSING VALUES to identify that new value as missing.

 

top


9. Recoding String Variables

Some things to pay attention to when recoding string variables:
(a) the width of the initial string variable; and
(b) the width of the string values in your RECODE command.

Remember the rule for strings? If the length of the given string (where length is indicated by the number of characters between the quotes) is shorter than the default format for the string, then SPSS pads the string with blanks to right.

Enter the following transformation using the recode dialog box. Run the transformation and then check the results. Are they what you expected?

RECODE SES ('Lower'=1)('Middle'=2)('Upper'=3) INTO NSES

The SES values Lower and Upper were right justified when they were entered. So they were not picked up by the recode command. Go back to the recode dialog box, fix the problem and rerun it.  The NSES values 1, 2, and 3 have no labels.  Why is that?   You have created a new variable, NSES, with the recode.  The new variable is set to numeric 8.2 and the values are initialized to system missing.  There are no value labels for this new variable.  Can you fix that problem?

The fourth case has no value for SES. The result of the above recode command is to set the NSES value of the fourth case to system missing. How would you set the blank value of the fourth case to 9 and then define 9s as user-missing?

top


ŠLee A. Becker, 1997-1999   -revised 10/12/99