Reading: SPSS Base 9.0 User's Guide: Chapter 6, File
Transformations, pp.105-112
Homework: Merge Files - Add Cases
Download: skills99.sav (use your own datafile)
skills98.sav (Download Tips)
skills97.sav
skills96.sav
skills95.sav
skills94.sav
skills93.sav
skills92.sav
skills91.sav
skills90.sav
skills89.sav
skills88.sav
![]()
The file transformation merge files allows you to combine two or more data files. The add cases subcommand allows you to combine systems files that have the same variables but different cases. For example, suppose that three different studies collected demographic data plus data on the Coolidge Axis Two Inventory (CATI) and you now want to collect all the CATI data into one large database. The merge files - add cases transformation will solve that problem. The add variables subcommand will allow you to combine files that have the same cases but different variables. For example, suppose that you are doing a large study and one student is entering the CATI data and another student is entering IQ and PTSD scale data for the same participants. The merge files - add variables transformation will allow you to combine those files. The focus of this set of notes is the merge files - add cases file transformation.
Lets take an example from the skills survey that we did at the beginning of this course. Suppose we want to know if there is a difference between graduates and undergraduates in terms of how much they know about factor analysis. The expectation is that the graduate students would be more knowledgeable than undergraduates about factor analysis. Let's read in the skills98.sav data and run the chi-square analysis for the variables "Are you taking the course for undergraduate/graduate/no credit" and "How much do you know about factor analysis?". The crosstabs table is displayed in Table 1 and the chi-square statistics are displayed in Table 2.
factor analysis |
Total | |||||
|---|---|---|---|---|---|---|
know nothing |
limited understanding |
understand reasonably well |
||||
Are you taking the course for ... |
undergraduate credit |
Count |
1 |
3 |
1 |
5 |
Expected Count |
1.4 |
2.8 |
.8 |
5.0 |
||
% within Are you taking the course for ... |
20.0% |
60.0% |
20.0% |
100.0% |
||
graduate credit |
Count |
3 |
7 |
1 |
11 |
|
Expected Count |
3.1 |
6.1 |
1.8 |
11.0 |
||
% within Are you taking the course for ... |
27.3% |
63.6% |
9.1% |
100.0% |
||
no credit |
Count |
1 |
0 |
1 |
2 |
|
Expected Count |
.6 |
1.1 |
.3 |
2.0 |
||
% within Are you taking the course for ... |
50.0% |
.0% |
50.0% |
100.0% |
||
Total |
Count |
5 |
10 |
3 |
18 |
|
Expected Count |
5.0 |
10.0 |
3.0 |
18.0 |
||
% within Are you taking the course for ... |
27.8% |
55.6% |
16.7% |
100.0% |
||
a. 8 cells (88.9%) have expected count less than 5. The minimum expected count is .33. |
The chi square is not significant, c ˛(4, N = 18) = 3.469, p = .483, but the percentage of cells with low expected counts is very high (89%) and the minimum expected count is very low (.33) so this chi square value is not valid. The expected counts are low for nearly all the cells. We need to increase the expected counts in order to have a valid chi square analysis.
One option would be to use the recode command to attempt to combine cells. For example combining the "no nothing" and "have a limited understanding" cells. In this case it doesn't look like that would help.
Another option would be to collect more data. We can't collect more data this semester, but we do have available the data from previous years. We could combine the data from previous years with this years data. The next section describes how to combine the data from previous years with our current data.
Our goal is to combine the data from several years of skills surveys. The general process will be to begin by opening the most current file, skills98.sav, and then merge the data from the next most recent file, skills97.sav, and so forth until all the files have been merged into the current active file. You could, of course, begin with any year.
Open the most current file, skills98.sav in the SPSS Data Editor.
To merge the next most recent file, skills97.sav, click on
Data
Merge Files
Two options will appear
Add Cases... and
Add Variables...
We are going to be adding new cases from the previous year so select the Add Cases... option. The Add Variables... option would be used if you want to merge additional variables to the current cases.
Select the file that you wish to add, skills97.sav, and Open the file.
A dialog box will open. The window on the right contains all the variables in the New Working Data File. This list contains only the variables that are common to both the working data file and the new data file. Any variables that are unpaired, that is variable names that occur in only one of the files, are not listed in the box on the right. Variable names that are follwed by an ">" are string varaibles (e.g., wordproc>).
The window on the left is labeled Unpaired Variables:. In that window are listed all the variables in Working Data File that are unpaired with variables in the file to be merged. An unpaired variable might result from several different causes:
1. One file has a variable that is not contained in the other file.
2. A variable does not have the same SPSS name in the two files. For example, if the simple anova question were named anova in one file and simanova in the other file.
3. A variable with the same name has a different data type in the two files. For example, a variable called date might be a string variable in one file and an actual date variable in the other file.
4. String variables of unequal length. For example date as A10 in one file but as A9 in the other file.
In this case there are two variables listed in the Unpaired Variables window, maj_oth> and maj_othr>. An unpaired variable followed by the symbol "[*]" is a from the Working Data File, the file that is currently in the Data Editor. At this point it is the data from skills98.sav. An unpaired variable followed by the "[+]" symbol is from the file to be merged, in this case the skills97.sav file. In this example the two variables refer to the same question, "other major," a major that was not listed in the "your major" item. The variable name is spelled as maj_oth in skills98.sav and as maj_othr in skills97.sav. To fix this problem:
1. Highlight one of the variables and rename it so that both names are the same. For example, rename maj_oth to man_othr.
2. Highlight both variables and press the Pair button. The variable will move to the Working Data File window.
You can remove a variable from the Working Data File if you do not wish to include that variable in the final, combined data file.
When you have made all the desired changes in the Working Data File and Unpaired Variables Windows then press the OK button to add the cases from skills97.sav to the Working File. Look at the data in the Data Editor to confirm that you have data from 1998 and 1997. Note that your name of the file in the Data Editor has been changed to "Untitled."
Next repeat the steps to add the cases from skills96.sav to the Data Editor.
Data
Merge Files
Add Cases...
Open skills96.sav
Thre are 15 unpaired variables from the Working Data File (those followed by the asterisk [*], and 12 unpaired variables from the to be merged file, skills96.sav.
The variables date[*], maj_othr[*], win97[*], and spsswin[*] were first used in skills97.sav so they do not exist in the earlier files. We want to keep those variables, so highlight each of them and move them to the Working Data File window.
The unpaired variable spss_pc(+) was used only in earlier years. This pc version of SPSS was not widely adopted. We had a coule of versions in the department but it was never taught here. Lets not keep this variable, that it, no not move spss_pc to the Working Dat File window.
The wordproc[*] variable is a string variable in the skills97.sav and skills98.sav files (it contains the name of your favorite word processor). That same data was coded prior to entering it in the earlier files. For example, in skills96, Word was coded as 1, Wordstar as 3, and AMI-Pro as 7. The coded data is stored in a numeric variable called favwp(+) in the skills96.sav file. I recommend saving both variables, wordproc[*] and favwp[+], by moving them to the Working Data File window. At some point it might be interesting to recode the string values in wordproc into the numeric variable favwp so that you could look at the changes in preferences for word processors over the years..
Another reason you may have unpaired variables is that the same variables may have different names in the two files. This can easily happen if you have different people working on the two files. For example the data from two years ago was coded and entered by a student who has now graduated. The current data is being entered and coded by a different person. If the same variables have different names you can a) rename one of the variables so it matches the name of its pair; and then b) use the pair button to move the two variables to the new working data file. The names of the common variables in the skills98.sav, skills97.sav and skills96.sav are shown in the next table.
| Name in skills98.sav and skills97.sav |
Name in skills96.sav |
|---|---|
| emacs(*) | emacscmd(+) |
| factor(*) | facanaly(+) |
| homecomp(*) | home_pc(+) |
| language(*) | prolang(+) |
| msdos(*) | ms_dos(+) |
| multregr(*) | regress(+) |
| spreadsh(*) | sprdshts(+) |
| ttest(*) | t_test(+) |
| unix(*) | unixcmds(+) |
The unpaired names in the skills97.sav and skills98.sav file are generally consistent with the names in the earlier files. The unpaired names in the skills96.sav file are unique. So the best strategy in this case is to rename the variables in the skills96.sav file to be consistent with those in the skills97.sav file. Lets use the rename command to rename each of the unpaired variables in the skills96.sav file. Highlight the variable name, click the rename button, and rename the variable and press continue.
Next highlight both variables of a pair, e.g., emacs(*) and emacscmd->emacs(+). (Press and hold down the ctrl key while clicking on a variable to highlight more than one variable at a time.) Then press the Pair key to move the variable pair to the new working data window. Repeat that process for each variable pair.
That leaves one variable remaining in the unpaired variables window, doswndow(+). That variable was only used for three years so it is probably not worth keeping. Let it remain in the unpaired window box and press the OK key to perform the merge and close the dialog box.
At this point lets save the information in the Data Editor, call it skills98_96.sav.
We now have 83 cases across 3 years of data. We began this set of notes with the crosstabs on credit and factor analysis. Lets see if we have enough data to run the chi square analysis.
The crosstab table for the data in the skills98_96.sav file is shown in Table 4. Cells with expected counts of less than 5 are highlighted. The chi square statistics are shown in Table 5.
Table 4. Are you taking the course for ... * factor analysis Crosstabulation
factor analysis |
Total |
||||||
know nothing |
limited understanding |
understand reasonably well |
expert |
||||
Are you taking the course for ... |
undergraduate credit |
Count |
5 |
19 |
5 |
1 |
30 |
Expected Count |
5.6 |
16.7 |
7.4 |
.4 |
30.0 |
||
% within Are you taking the course for ... |
16.7% |
63.3% |
16.7% |
3.3% |
100.0% |
||
graduate credit |
Count |
9 |
24 |
14 |
0 |
47 |
|
Expected Count |
8.7 |
26.1 |
11.6 |
.6 |
47.0 |
||
% within Are you taking the course for ... |
19.1% |
51.1% |
29.8% |
.0% |
100.0% |
||
no credit |
Count |
1 |
2 |
1 |
0 |
4 |
|
Expected Count |
.7 |
2.2 |
1.0 |
.0 |
4.0 |
||
% within Are you taking the course for ... |
25.0% |
50.0% |
25.0% |
.0% |
100.0% |
||
Total |
Count |
15 |
45 |
20 |
1 |
81 |
|
Expected Count |
15.0 |
45.0 |
20.0 |
1.0 |
81.0 |
||
% within Are you taking the course for ... |
18.5% |
55.6% |
24.7% |
1.2% |
100.0% |
a 6 cells (50.0%) have expected count less than 5. The minimum expected count is .05. |
There is still a problem with the expected counts. Half of the cells have expected counts less than 5 and the minimum expected count is .06. There were only 5 people who took the course for no credit. The expected counts are all less than 5 for the no credit group. In addition, over the past 3 years there is only 1 person who considered him- or herself to be experts in factor analysis. We need another approach to resolving the low expected count problem.
As noted in the previous section the low expected frequencies occur in the "no credit" row and the "expert" column. We can use recode to solve those two problems. There are only 4 people who took the course for no credit. We don't know whether they are undergraduate or graduate students so it wouldn't make sense to recode them as either undergraduates or graduates so lets recode them as system missing. That is we could just delete those four cases from consideration.
There is a different option for the case in the "expert" column, add that case to the "understand reasonably well" column by recoding 4s as 3s. We don't loose any data that way. You should also relabel the value 3 as "understand reasonably well and expert."
The chi square tests are shown in Tables 6 and 7.
factor analysis |
Total |
|||||
know nothing |
limited understanding |
understand reasonably well and expert |
||||
Are you taking the course for ... |
undergraduate credit |
Count |
5 |
19 |
6 |
30 |
Expected Count |
5.5 |
16.8 |
7.8 |
30.0 |
||
% within Are you taking the course for ... |
16.7% |
63.3% |
20.0% |
100.0% |
||
graduate credit |
Count |
9 |
24 |
14 |
47 |
|
Expected Count |
8.5 |
26.2 |
12.2 |
47.0 |
||
% within Are you taking the course for ... |
19.1% |
51.1% |
29.8% |
100.0% |
||
Total |
Count |
14 |
43 |
20 |
77 |
|
Expected Count |
14.0 |
43.0 |
20.0 |
77.0 |
||
% within Are you taking the course for ... |
18.2% |
55.8% |
26.0% |
100.0% |
a 0 cells (.0%) have expected count less than 5. The minimum expected count is 5.45. |
The low expected count problem has been resolved. The data indicate that knowledge of factor analysis is unrelated to taking the course for undergraduate or graduate credit, c ˛(2, N = 77) = 1.231, p = .54. Does anyone have an explanation for these findings?
ŠLee A. Becker, 1997-1999 -revised 07/13/00