Explore: Statistics and Charts

Reading: SPSS Base 9.0 User's Guide: Chapter 13, Explore
Activities:
Homework:
Download: explore.sav         (Download Tips)

  1. Overview
  2. Select the Explore Procedure
  3. Select One or More Variables, Run the Procedure, and Interpret the Output
  4. Case Processing Summary & Descriptive Statistics
  5. Histograms
  6. Stem and Leaf Plots
  7. Boxplots
  8. Normality Plots
  9. Outliers
  10. M-estimators

1. Overview

The explore notes are in two parts. Part 1, this set, describes the statistics and graphical displays that are available in explore. Part 2 describes how to use the  explore procedure to test the assumptions of a oneway analysis of variance.

The graphical displays from explore provide visual information about central tendency, variability and distribution shapes. The purpose of this set of notes is to focus on the visual information presented by those displays rather than on the text associated with them. To that end, the graphics have been reduced in size so that the dominant information is visual. Don't worry if you can't read the text for those displays.

The data

The data for this set of notes can be found in the file explore.sav. The variables used in this set of notes are: normal, skewed, and uniform. They are representative examples of data that is normally distributed, positively skewed, and platykurtic. (The other variables in the file will be used in the the next set of explore notes.) Our goal is to try to distinguish between those shapes using the various statistics and graphical displays provided by explore.

Table 1. The variables in explore.sav.
Variable Name Variable Label / Value Label
id Participant ID number
condit Treatment condition /
  1 'BOTH SONG & GRAPH'
  2 'GRAPH ONLY'
 
  3 'SONG ONLY'
gender Gender of participant /
   M = "Male"   F = "Female"
numright The number correct over 20 trials
normal A normally distributed variable.
skewed A positively skewed variable.
uniform A platykurtic variable.

top


2. Select the Explore Procedure

The explore dialog box is opened by clicking

Analyze
      Descriptive Statistics
            Explore

The dialog box shows the variable names in the data set in the left box. The variables to be analyzed will be moved to the empty box to the right. Display options (statistics, plots, or both), statistics, plots, and missing value options are accessed by the buttons at the bottom of the dialog window. The buttons to the far right will run the procedure (OK), paste the commands into the syntax window, reset the variables to be analyzed, cancel the procedure, and provide help.

For the first time you can also select a Factor List, factors are the independent variables.

And you can select the variable that will be used to identify individual cases such as outliers and extreme values (see Label Cases by:). By default SPSS will identify the variables by the sequential case number in the data file. You could override the default and use, say, the ID values rather than the case number.

Statistics... 

Descriptives (By default the confidence interval for the mean is selected to be the 95% confidence interval. You can enter any other confidence interval), 
M-estimators,
Outliers, and
Percentiles.

Plots..

Boxplots (factor levels together as the default);
Descriptive plots -- Stem-and-Leaf (default) and Histograms; 
Normality Plots with tests (Komogorov-Smirnov statistics with a Lilliefors significance level, and Shapiro-Wilks for samples with 50 or fewer observations (a test of normality); and
Spread vs. Level with Levene Test (a test of homogeneity of cell variances).

Spread vs. Level plots can be displayed only if there is at least one variable selected in the Factor List:.

Missing Values Options...

Clicking the Options.. button at the lower left of the explore dialog box will open the missing values dialog box. The missing values options tell SPSS how to handle values that are user-missing or system missing.

The exclude cases listwise option (the default) will delete the entire case from the analysis if any value in either the dependent list or the factor list is missing. This option results in equal n's for the reported statistics.

The exclude cases pairwise option will report statistics based on all the available valid data. If there are missing values then the n's may be different for the reported statistics.

The report values option will include missing values. They will be reported as a separate category.

top


3. Select One or More Variables, Run the Procedure, and Interpret the Output

Let's select the normal, skewed, and uniform variables for the Dependent List:

The default output includes the Case Processing Summary, Descriptive Statistics, Stem-and-Leaf Plots, and Boxplots.

Lets also look at the the following statistical information: M-estimators, outliers and percentiles; and the following histograms for these variables, and the following plots: histogram and normality plots with tests

top


4. Case Processing Summary & Descriptive Statistics

Table 2. Case Processing Summary

Cases

Valid Missing Total
N Percent N Percent N Percent
SKEWED 75 100.0% 0 .0% 75 100.0%
UNIFORM 75 100.0% 0 .0% 75 100.0%
NORMAL 75 100.0% 0 .0% 75 100.0%

The Case Processing Summary, see Table 2, indicates the number of percent of valid cases, missing cases, and the total number of cases.  Each of the three variables has 75 valid cases (100%) and no missing cases (0%).

Table 3. Descriptive Statistics.

Statistic Std. Error
NORMAL Mean .5433 .5391
95% Confidence Interval for Mean Lower Bound -.5308
Upper Bound 1.6174
5% Trimmed Mean .5099
Median .6733
Variance 21.795
Std. Deviation 4.6685
Minimum -8.82
Maximum 10.60
Range 19.42
Interquartile Range 6.6331
Skewness .100 .277
Kurtosis -.556 .548
SKEWED Mean 28.1867 1.0008
95% Confidence Interval for Mean Lower Bound 26.1924
Upper Bound 30.1809
5% Trimmed Mean 27.6074
Median 27.0000
Variance 75.127
Std. Deviation 8.6676
Minimum 17.00
Maximum 58.00
Range 41.00
Interquartile Range 14.0000
Skewness .891 .277
Kurtosis .612 .548
UNIFORM Mean 2.7623 .1757
95% Confidence Interval for Mean Lower Bound 2.4122
Upper Bound 3.1124
5% Trimmed Mean 2.7847
Median 2.7537
Variance 2.315
Std. Deviation 1.5216
Minimum .12
Maximum 4.97
Range 4.86
Interquartile Range 2.6004
Skewness -.201 .277
Kurtosis -1.268 .548

The descriptive statistics displayed by the explore procedure are shown in Table 3. Two new statistics provided by explore are the 5% trimmed mean and the interquartile range.

The 5% trimmed mean excludes 5% of the cases from each end of the set of scores. To remove the 5% from each end of the set of scores you would first sort the cases according to the values of a variable. Then you would delete 5% of the cases at both ends of the distribution and compute the mean based on the remaining cases. In this example there are 75 valid cases, so 5% of 75 or 3.75 cases would be trimmed at each end of the distribution (the SPSS formula allows for fractional trimming).  If there are outliers (values that are very extreme) or if the distribution is skewed, then the 5% trimmed mean will be different from the untrimmed mean.

The interquartile range is the distance between the 25th percentile and the 75th percentile.

You should compute the confidence intervals around the skewness and kurtosis scores for each variable to confirm for yourself that the variables are normal, positively skewed, and platykurtic.

top


5. Histograms

Figure 1. Histograms for the variables normal, skewed, and uniform.

Histogram Histogram Histogram

Can you look at these three histograms and determine which is normal, which is skewed, and which is platykurtic?

Note that, unlike the frequencies procedure, you cannot superimpose a normal curve on the histograms created by the explore procedure.

A general rule of thumb is that you should use the graphical tools to get a general idea of the shapes of the distributions. You should also use the statistical tests (e.g., the skewness and kurtosis statistics) to determine whether or not the scores meet the objective tests for normality.

top


6. Stem-and-Leaf Plots

Figure 2. Stem-and-leaf plots for the variables normal, skewed, and uniform.

NORMAL Stem-and-Leaf Plot

 Frequency    Stem &  Leaf

    10.00       -0 .  5566667788
    24.00       -0 .  000011111111122233333344
    28.00        0 .  0000011111111122222233344444
    12.00        0 .  566666688999
     1.00        1 .  0

 Stem width:     10.00
 Each leaf:       1 case(s)
SKEWED Stem-and-Leaf Plot

 Frequency    Stem &  Leaf

    11.00        1 .  78888899999
    21.00        2 .  000000001112222222333
    12.00        2 .  666777888999
    15.00        3 .  000111111223444
     8.00        3 .  57788899
     5.00        4 .  11124
     2.00        4 .  58
     1.00 Extremes    (>=58)

 Stem width:     10.00
 Each leaf:       1 case(s)
UNIFORM Stem-and-Leaf Plot

 Frequency    Stem &  Leaf

     7.00        0 .  1122234
     8.00        0 .  66788899
     3.00        1 .  014
     8.00        1 .  56778889
     7.00        2 .  1111234
     7.00        2 .  5577789
     6.00        3 .  133444
     6.00        3 .  667788
    11.00        4 .  00111112344
    12.00        4 .  555666788999

 Stem width:      1.00
 Each leaf:       1 case(s)

Information about the shape of the distribution is also provided by these stem-and-leaf plots. The visual information is similar to that provided by histograms. You should be able to look at the shape of the plots and distinguish between the three distributions.

Stem-and-leaf plots actually provide a very detailed summary of the scores. To understand how a stem-and-leaf plot is constructed you need to first determine the range of scores for a variable. Lets look at the skewed variable. The values ranged from 17 to 58 (see Table 1). So we are looking for scores in that range in the stem-and-leaf plot. The information at the bottom of the plot states that the stem width is 10. This indicates that the stem values, 1 through 4, represents the 10's part of the scores. That is, the stem 1 represents the number 10, the stem 2 represents the number 20, the stem 3 represents the number 30 and the stem 4 represents the number 40. To understand how this works lets look at the data itself.

The top row of the stem-and-leaf data for the skewed variable is -

  11.00        1 .  78888899999

The first number, 11.00, is the frequency for that row. There are 11 cases represented in that row of data. The next number, 1, is the stem of the stem-and-leaf, indicating that the values are in the 10's range. The remaining numbers, 78888899999, are the leaves. For the skewed variable each leaf represents 1 case. If you combine the stem and leaf you get the scores for that variable. There is one score of 17, five scores of 18, and five scores of 19.

The second row for the skewed variable is -

    21.00        2 .  000000001112222222333

There are 21 scores in the second row of the stem-and-leaf plot for the skewed variable. They are distributed as follows: eight 20's, three 21's, seven 22's, and three 23's.

The last row indicates that there is one extreme value, that value is greater than or equal to 58.

You should be sure that you understand how the stem-and-leaf plots for the normal and uniform variables represents the values of those variables.

top


7. Boxplots

Figure 3. The elements of a boxplot.

The basic elements of a boxplot are shown in Figure 3. The top of the box is called the upper fourth. It is at the 75th percentile of the scores. The bottom of the box is called the lower fourth. It is at the 25th percentile of the scores. Therefore 50 % of the scores fall within the box. The interquartile range is the distance between the upper fourth and the lower fourth. The horizontal line through the box represents the median.

The ends of the whiskers represent the largest and smallest values that are not outliers.

An outlier, O, is defined as a value that is smaller (or larger) than 1.5 box-lengths from the lower fourth (upper fourth). The box-length is defined as the interquartile range. 

An extreme value, E, is defined as a value that is smaller (or larger) than 3 box-lengths from the lower fourth (upper fourth).

You should be wary of "outliers" or "extreme" values. Outliers and extreme values will tend to bias statistics that are based on "interval" level data. A single extreme value can drastically change the correlation between two variables. Go back to check if the outliers or extreme values were scored correctly and entered into the data file correctly. Ask yourself if there anything unusual about the experimental conditions for that respondent?

Figure 4. Boxplots for the normal, skewed, and uniform variables.

Boxplot Boxplot Boxplot

In Figure 4 the normal variable is represented by boxplot at the left, the skewed variable is the middle boxplot, and the uniform variable is boxplot at the right.

There are no extreme cases in any of these boxplots. There is one outlier in the skewed variable. It is represented by the letter O above the top of the whisker. The case number of the outlier is given next to the letter O, although it is too small to see in these boxplots.  

Lets compute the ranges of the scores that would be considered outliers and extremes at the high end of the skewed scores.  The score at the 25th percentile (the lower fourth)  is 20.0; the score at the 75th percentile (the upper fourth) is 34.00.  The interquartile range (the box-length) for the skewed scores is therefore 14.0.  Scores that are beyond 1.5 box-lengths from the upper (or lower) fourth are considered to be outliers.  Therefore scores this skewed distribution that are greater than 55 (34 + 1.5*14) are outliers.  Scores that are beyond 3 box lengths from the upper (or lower) fourth are considered to be extreme scores.  Therefore scores from this skewed distribution that are greater than 76 (34 + 3*14) are extreme scores. The case number of the identified outlier is #38.  The score on the skewness variable for case #38 is 58.0.  That score is greater than 55 so it is an outlier.  It is not greater than 76 so it is not an extreme score.

Normally distributed scores typically have whiskers that are about the same length. The box is typically smaller than the whiskers.

Skewed scores typically have one whisker that is markedly longer than the other whisker, and usually have outliers and extreme scores that cluster on one side of the box.

Uniform scores typically have a large box size relative to the size of their whiskers.

top


8. Normality Plots

Statistics such as the t-test and analysis of variance assume that the distributions of the scores in each of the cells of the design are normally distributed. Explore provides several ways of testing whether the normality assumption has been met. We have already discussed using the skewness and kurtosis statistics to examine specific departures from normality. Normality plots provide another tool for assessing whether the shape of a distribution is normal.

Figure 5. Normal Q-Q plots for the normal, skewed, and uniform variables.

Normal q-q plot Normal q-q plot Normal q-q plot

Figure 6. Detrended normal Q-Q plots for the normal, skewed, and uniform variables.

Detrended normal q-q plot Detrended normal q-q plot Detrended normal q-q plot

Normal plots compare the obtained scores (on the x-axis) with expected z scores (on the y-axis). That is, a normal plot compares the actual scores with z scores that would have been obtained if the the scores were normally distributed. The expected z scores are found by finding the percentile rank of each score, and then converting the percentile rank to a z score using a normal probability table. You could use the frequencies procedure to find the percentile rank for each score and use a table of areas under the standard normal distribution (found in most statistics books) to find the z score for that percentile. For example, the median score has a percentile rank of 50. The percentile rank of 50 converts to a z-score of 0.00. A score at the lower fourth (at the 25th percentile) converts to a score of -0.68 and a score at the upper fourth (the 75th percentile) converts to a z-score of +0.68.

If the scores are normally distributed then the plot of the raw scores vs. the expected z scores will be a straight line. The plot of the scores will fall on top of the straight green line that runs from the bottom left to the top right of the plot. The plots of scores that are not normally distributed will deviate from that straight line. Compare the normal plots for the skewed and uniform variables with that of the normal variable. The plot of the normal variable is closely aligned with the straight line (the left plot in Figure 5). The plot for the skewed variable (the middle plot in Figure 5) shows marked departures from the straight line for the low scores. The uniform variable (the right plot in Figure 5) shows departures from the straight line at both the low and the high scores.

The "detrended normal plots," emphasizes the deviations from the theoretical straight line. Think taking the normal plot and rotating it so that the diagonal green line is horizontal. Then rescale the plot to emphasize the distance that each points falls above and below the now horizontal line. If the scores are normally distributed then some of data points will fall above the horizontal line, others will fall below it, and the pattern of scores above and below the horizontal line will be random (see the detrended normal plot for the normal variable at the far left of Figure 6).Distributions that are not normal show as distinct patterns in the detrended normal plot. A skewed distribution typically shows a "J" or an inverted "U" distribution (see the middle plot of Figure 6). A uniform distribution typically shows an "S" like distribution (see the right plot of Figure 6).

I personally find the exact shapes hard to determine. But usually the appearance of a clear shape rather than a random pattern is a warning that the data is not normally distributed. You will still want to look at the statistical tests of normality. When you ask for normality plots SPSS will also print out an additional statistical test of normality. The Kolmovorov-Smirnov statistic will be printed. If the number of cases is less then 50 then the Wilks-Shaprio statistic is printed. These statistics take into account departures from normality with respect to both skewness and kurtosis. In a sense the combine all departures from normality. When we looked at the confidence interval for skewness and kurtosis we were looking at only a single type of departure from normality.

The null hypothesis for these tests is that the distribution is normally distributed. If the statistic is not significant then we cannot reject the null hypothesis that the distributions are normal. If the statistic is significant, then we do reject the null hypothesis that the distributions are normal. To put it another way, if the significance level is at p < .05, then the scores are not normally distributed.

Table 5. Tests of Normality

Kolmogorov-Smirnov(a)
Statistic df Sig.
NORMAL .040 75 .200(*)
SKEWED .152 75 .000
UNIFORM .106 75 .037
* This is a lower bound of the true significance.
a Lilliefors Significance Correction

As shown in Table 3, the Kolmogorov-Smirnov(a) statistics is not significant for the normal variable, K-S(75) = .040, p > .2000, but it is significant for both the skewed variable, K-S(75) = 0.152, p < .0005, and uniform variable, K-S(75) = 0.106, p = .037.

Note that the printed significance level for the skewed variable was .000. It should be read as p < .0005. SPSS has rounded the significance level from four digits down to three digits. If the significance level at four digits were > .0005 then the rounded significance level at three digits would have been .001. If general, if the significance level is printed as all zeros, then add 5 to then end of the last zero and report the significance level as p < that amount.

A caution. The K-S and the Wilks-Shapiro statistics are sensitive to both skewness and kurtosis. In general, statistics based on interval scales of measurement tend to be "robust" if the distributions are symmetric. That is, kurtosis tends not to be a serious problem for the t test and the analysis of variance. On the other hand, differences in skewness are more serious. If the K-S statistic is not significant, then there is no problem. If it is significant, then you should examine whether the departure from normality is due to skewness, kurtosis, or both.

top


9. Outliers

The case numbers and values for the five highest and lowest values of each variable are displayed when the Outliers statistics option is chosen. Contrary to the implied meaning of this option, the values are not necessarily either outliers or extreme values (as defined in the boxplots section above), they are just the five values at each end of the distribution.

Table 6. Extreme Values

Case Number Value
NORMAL Highest 1 19 10.60
2 9 9.73
3 44 9.27
4 47 9.04
5 28 8.98
Lowest 1 30 -8.82
2 43 -8.43
3 26 -7.46
4 7 -7.12
5 4 -6.72
SKEWED Highest 1 38 58.00
2 20 48.00
3 39 45.00
4 54 44.00
5 44 42.00
Lowest 1 33 17.00
2 11 18.00
3 71 18.00
4 22 18.00
5 74 .(a)
UNIFORM Highest 1 28 4.97
2 63 4.95
3 54 4.94
4 51 4.90
5 42 4.85
Lowest 1 20 .12
2 25 .17
3 29 .22
4 75 .22
5 31 .23
a Only a partial list of cases with the value 18 are shown in the table of lower extremes.

top


10. M-estimators

M-estimators are estimates of central tendency that are alternatives to the mean and median. Unless you are familiar with these estimators, how they are computed, and when it is appropriate to use them, you shouldn't report them. The M-estimators reported in the examine procedure are not included in our introductory statistics book, nor in the advanced statistics books that I am familiar with. The advantage of powerful statistics packages like SPSS and SAS is that they provide easy access to a wide range of statistics. That is also their most serious drawback. You should be very cautious in reporting statistics that are unfamiliar to you. If you have any doubts, seek expert advice or make yourself an expert on the statistic.

M-estimators are alternative estimates of central tendency that are robust to deviations from normality.  The estimators provided by SPSS were designed for symmetric distributions. They could be used to describe demographic data such as income.   I have not seen these estimators used in the psychology literature, they may be more appropriate for sociological, business, and economic data.

Table 7. M-Estimators

Huber's M-Estimator(a) Tukey's Biweight(b) Hampel's M-Estimator(c) Andrews' Wave(d)
NORMAL .4665 .4491 .4914 .4502
SKEWED 27.1531 27.1264 27.5648 27.1232
UNIFORM 2.8197 2.8000 2.7757 2.8000
a The weighting constant is 1.339.
b The weighting constant is 4.685.
c The weighting constants are 1.700, 3.400, and 8.500
d The weighting constant is 1.340*pi.


M-estimators are computed by assigning different weights to values depending on the value's distance from the center of the distribution.  As the value becomes more distant from the center of the distribution, the weight becomes smaller.  Different m-estimators have different weighting systems.

If an m-estimator is reported, then the spread of the distribution should be reported as percentiles rather than as standard deviations.

Table 8. Percentiles

Percentiles
5 10 25 50 75 90 95
Weighted Average(Definition 1) NORMAL -7.1881 -6.0947 -2.8295 .6733 3.8035 6.6844 9.0832
SKEWED 18.0000 19.0000 20.0000 27.0000 34.0000 41.0000 44.2000
UNIFORM .2230 .5730 1.5637 2.7537 4.1642 4.6962 4.9035
Tukey's Hinges NORMAL

-2.8029 .6733 3.5399

SKEWED

20.5000 27.0000 33.5000

UNIFORM

1.6090 2.7537 4.1634

top


ŠLee A. Becker, 1997-1999 -revised 09/24/99