Statistics II: Examining Relationships between Two Variables

Stewardship
October 15, 2020
Explore this and other web sites and answer the following ten questions on Mars – http://mars.jpl.nasa.gov/
October 15, 2020

Statistics II: Examining Relationships between Two Variables

In this chapter you will learn:
How to define association.
What chi-square represents.
Which measure of association is most appropriate when comparing variables with different levels of measurement.
In political science research, we are generally less concerned with describing distributions on single variables than we are with determining whether, how, and to what extent two or more variables may be related to one another. It is these bivariate (two-variable) and multivariate (more-than-two-variable) relationships that usually cast light on the more interesting research questions.
When examining the relationship between two variables, we typically ask three important questions. The first is whether and to what extent changes or differences in the values of one variable—generally the independent variable—are associated with changes or differences in the values of the second, or dependent, variable. The second question examines the direction and form of any association that might exist. The third considers the likelihood that any association observed among cases sampled from a larger population is in fact a characteristic of that population and not merely an artifact of the smaller and potentially unrepresentative sample. In this chapter we introduce some of the statistics that are most commonly used to answer these questions, and we explain when it is appropriate to use them and what they tell us about relationships.
MEASURES OF ASSOCIATION AND STATISTICAL SIGNIFICANCE
An association is said to exist between two variables when knowing the value of one for a given case improves the odds of our guessing correctly the corresponding value of the second. If, for example, we examine the relationship between the size of a country’s population and the proportion of its adults who are college educated, we may variously find (1) that larger countries generally have a greater proportion of college-educated adults than smaller ones, (2) that smaller countries generally have a greater proportion of college-educated adults than larger ones, or (3) that there is no systematic difference between the two—that some countries from each group have relatively high proportions of such people but that some from each group have low proportions as well. If our research shows that either case 1 or case 2 holds, we can use our knowledge of values on the independent variable, size of population, to guess or predict values on the dependent variable, proportion of adults who are college educated, for any given country. In the first instance, for any heavily populated country, we predict a relatively high proportion of college-educated adults, and for a less populous nation, we predict a lower proportion. In the second, our prediction is precisely reversed. In either event, although we may not guess every case correctly, we will be right fairly often because of the underlying association between the two variables. Indeed, the stronger the association between the two variables (the more the individual countries’ educational level values tend to align on each in precisely the same order), the more likely we are to guess correctly in any particular instance. If there is total correspondence in the alignments on the two variables, high scores with high scores or, alternatively, high scores on one with low on the other, we can predict one from the other with perfect accuracy. This contrasts sharply with the third possibility, which permits no improved prediction of values on the education variable based on our knowledge of populations. In such instances, when cases are, in effect, randomly distributed on the two variables, there is said to be no association.
To get a mental picture of what a strong association might look like, consider the two maps presented in Figure 17.1, which relate to the murder rate in Washington, DC, during the “crack wars” of the 1980s. Figure 17.1(a) shows the location of known drug markets in the nation’s capital; Figure 17.1(b) shows the location of homicides. Both are based on information provided by the DC Metropolitan Police Department. The apparent similarity in the locations of clusters of drug dealing and murders suggests an association between the two phenomena.
Measuring Association
Clearly there can be more or less association between any two variables. The question in each instance then becomes, Just how much association is there? The answer is provided by a set of statistics known as coefficients of association. A coefficient of association is a number that summarizes the amount of improvement in guessing values on one variable for any case based on knowledge of the values of a second. In the example, for instance, such a measure would tell us how much our knowledge of a country’s population size helps us in guessing its proportion of college-educated adults. The higher the coefficient, the stronger the association and, by extension, the better our predictive or explanatory ability. In general, coefficients of association range from 0 to 1 or – to 1, with the values closest to unity indicating a relatively strong association and those closest to 0 a relatively weak one.
In addition to the magnitude of association, it is also useful to know the direction or form of the relationship between two variables. Take another look at the earlier example about level of education of a nation’s adults, and most particularly at options 1 and 2. We have already suggested that the closer we get to either case, the higher will be our coefficient of association and the better our chances of guessing a particular country’s proportion of college-educated adults based on our knowledge of its population size. It should be obvious, however, that our predictions in the cases are precisely opposite. In the first instance, higher values of one variable tend to be associated with higher values of the other, and in the latter instance, higher values of one tend to be associated with lower values of the other. Such relationships are said to display differences in direction. Those like the first, in which both variables rise and fall together, are termed direct, or positive, associations. Those like the second, in which scores move systematically in opposing directions, are termed inverse, or negative, associations. This additional bit of information, which is represented by a plus or a minus sign accompanying the coefficient of association, makes our guessing even more effective. Thus, a coefficient of –.87 (negative and relatively close to negative 1) might describe a relatively strong relationship in which the values on the two variables in question are inversely related (moving in opposite directions), whereas a coefficient of .20 (positive—the plus sign is usually omitted—and rather close to zero) might describe a weak direct association.
FIGURE 17.1 Drug markets and homicide locations, Washington, DC, 1988.

Source: Reprinted from the Washington Post, January 13, 1989, p. E1, with permission of the publisher.
Defining Statistical Significance
Finally, we should say a word about tests of statistical significance, though our discussion of the topic will be purposely limited.1 You will recall from our discussion of levels of confidence and sampling error in Chapter 7 that when we draw a presumably representative sample and use that sample to develop conclusions about the larger population from which it is drawn, we run some risk of coming to incorrect conclusions. This is true because there is a chance that the sample is not in fact representative and that the actual error in our measurement exceeds that specified for a given sample size (Tables A.2 and A.3 in Appendix A). The chance of such improper generalizing is known, but we cannot tell whether or not it has occurred in any particular instance. For a level of confidence of .95, that chance is .05, or 1—.95. For a level of confidence of .99, it is .01. These values represent the likelihood that any generalization from our sample to the larger population, even allowing for the estimated range of sampling error, is simply wrong.
Tests of statistical significance perform the same function in evaluating measures of association. They tell us just how likely it is that the association we have measured between two variables in a sample might or might not exist in the whole population. Let us see if we can clarify this point.
1 A full explanation of statistical significance is beyond the scope of this text; to pursue a deeper understanding of significance testing, you are encouraged to consult one of the statistics texts listed at the end of Chapter 18.
An Example
Suppose, to continue our example, we have a population of 200 nations for which we know for a fact that the coefficient of association between population size and the proportion of adults with a college education is 0. There is, in reality, no relationship between the two variables. But suppose further that we take a sample of only 30 of these countries and calculate the association between these two variables. It might come out as 0, but this is actually unlikely, because the strength of association is now based not on all the countries but on only 30 and will probably reflect their particular idiosyncrasies. In other words, the coefficient itself is determined by which 30 countries we pick. If, by chance, we pick 30 countries that are truly representative of all 200, we will in fact find no association. But chance might also lead us to pick 30 countries for which the association between population size and education level is unusually high, say, .60. In that case, our coefficient of association measures a characteristic of the particular sample in question, but if we generalize to the larger population, our conclusions will be incorrect. Knowing this, of course, we reject our measure of association based on this particular sample.
The problem is that in the real world we seldom know the underlying population parameter, which is the true degree of association in the whole population (as defined in Chapter 7). Indeed, the reason to draw samples in the first place is exactly because we often simply cannot study whole populations. It follows, then, that more often than not the only tests of association we will have will be those based on our sampling. Moreover, these calculations will usually be based on only one sample. Thus, the question becomes one of how confident we can be that a test of association based on a single subgroup of a population accurately reflects an underlying population characteristic. The job of the test of statistical significance is to pin a number on that confidence—that is, to measure the probability or likelihood that we are making an appropriate, or, conversely, an inappropriate, generalization.
To see how this works, let us continue our example. Suppose that we draw not one sample of 30 nations from our population of 200, but 1,000 separate and independent samples of the same size and that for each we calculate the coefficient of association. Because the true coefficient for the entire population is in fact 0, most of the coefficients for our 1,000 samples will also be at or relatively near 0. Some particular combinations of 30 countries may yield relatively higher values (that is, we might by chance happen to pick only countries scoring either high-high or low-low on the two variables), but the majority will be nearer to the population parameter. Indeed, the closer one gets to the true value, the more samples one finds represented. These distributions, in fact, often resemble the normal curve mentioned earlier. This is illustrated in Figure 17.2, where the height of the curve at any given point represents the number of samples for which the coefficient of association noted along the baseline has been calculated. As you can see, most of the sample coefficients cluster around the true population parameter.
What, then, is the likelihood that any particular coefficient is simply a chance variation around a true parameter of 0? Or, in other words, if we take a sample from some population and find in that sample a strong association, what are the chances that we will be wrong in generalizing so strong a relationship from the sample to the population? The normal curve has certain properties which enable us to answer this question with considerable precision.
FIGURE 17.2 Normal distribution of coefficients of association for samples of 30 cases.

Suppose, for example, we draw from our 200 nations a sample of 30 for which the coefficient of association is —.75. How likely is it that the corresponding coefficient for the population as a whole is 0? From Figure 17.2, the answer must be a resounding Not very! The area under the curve represents all 1,000 (actually any number of) sample coefficients when the true parameter is 0. The much smaller shaded area at and to the left of —.75 represents the proportion of such coefficients that are negative in direction and .75 or stronger in magnitude. Such cases constitute a very small proportion of the many sample coefficients. For this reason, the odds of drawing such a sample in any given try are quite slim. If 5 percent of all samples lie in this area, for instance, then only one time in twenty will we be likely to encounter a sample from a population with a true coefficient of 0 for which we find a coefficient in our sample of –.75. Yet that is precisely what we have found in this instance.
In other words, we have just drawn a sample with a characteristic that has a 5 percent likelihood of being an erroneous representation of a population in which the two variables in question are not associated with each other. Thus, if we claim on the basis of our sample that the two variables are in fact associated in the larger population (that is, if generalizing our results from the sample), we can expect to be wrong 5 percent of the time. That means, of course, that we will be right 95 percent of the time, and those are not bad odds. Indeed, levels of statistical significance of .05 (a 5 percent chance of erroneous generalization), .01 (a 1 percent chance of such error), and .001 (a 1/10 of 1 percent chance of such error) are commonly accepted standards in social science research.
If we look again at Figure 17.2, it should be apparent that more extreme values such as –.75 are less likely to give rise to this kind of error in generalization than are those closer to the center (for example, a greater proportion of samples from such a population will, by chance, show coefficients of –.50 or stronger, and so forth). It seems, then, that we can never be very confident of the trustworthiness of weaker associations, since we can never eliminate the heavy odds that they are simply chance occurrences in a population with a true coefficient of 0.
We can increase our confidence in our sample simply by increasing our sample size. If instead of 30 cases per sample we draw 100 or 150, each will be more likely to cluster around 0. In effect, the normal curve will be progressively squeezed toward the middle, as illustrated in Figure 17.3, until ultimately there is only one possible outcome—the true parameter. In the process, with a set of sufficiently large samples, even a coefficient of association of .10 or .01 can be shown to have acceptable levels of statistical significance. We can conclude, then, that some combination of sufficiently extreme scores and sufficiently large samples allows us to reduce to tolerable levels the likelihood of incorrectly generalizing from our data.
FIGURE 17.3 Sampling distribution for differing numbers of cases in a population of 200.

In the balance of this chapter we present a brief discussion of the most common measures of association and significance for each of the three levels of measurement. Although the procedures employed in calculating each of these measures differ, the purpose in each case, as well as the interpretation of the result, will remain relatively consistent, for each coefficient of association is designed to tell us to what extent our guessing of values on one variable is improved by knowledge of the corresponding values on another. Each test of significance tells us the probability that any observed relationships in a sample result from bias in the sample rather than from an underlying relationship in the base population.
The examples we use to illustrate these statistics involve comparisons of variables that are operationalized at the same level of measurement. However, researchers often want to look for relationships between variables that are at different levels of measurement (as in the case of an ordinal-level independent variable such as socioeconomic status and a nominal-level dependent variable such as party identification). To select the correct statistic in these situations, you need to be aware of a simple rule: You can use a statistic designed for a lower level of measurement with data at a higher level of measurement, but you may not do the reverse—doing so would produce statistically meaningless results. It would, for example, be legitimate to use a statistic designed for the nominal level with ordinal-level data, but illegitimate to use an ordinal-level statistic with nominal-level data. This means that when comparing variables that are measured at different levels of measurement, you must choose a statistic suitable to the lower of the two levels.
MEASURES OF ASSOCIATION AND SIGNIFICANCE FOR NOMINAL VARIABLES: LAMBDA
A widely used coefficient of association for two nominal variables where one is treated as independent and the other dependent is ? (lambda).2Lambda measures the percentage of improvement in guessing values on the dependent variable on the basis of knowledge of values on the independent variable when both variables consist of categories with neither rank, distance, nor direction.
An Example
Suppose we measure the party identification of 100 respondents and uncover the following frequency distribution:
Democrats
50
Republicans
30
Independents
20
Suppose further that we want to guess the party identification of each individual respondent, that we must make the same guess for all individuals, and that we want to make as few mistakes as possible. The obvious strategy is simply to guess the mode (the most populous category), or Democratic, every time. We will be correct 50 times (for the 50 Democrats) and incorrect 50 times (for the 30 Republicans and 20 Independents), not an especially noteworthy record but still the best we can do. For if we guess Republican each time, we will be wrong 70 times, and a guess of Independent will lead to 80 incorrect predictions. The mode, then, provides the best guess based on the available information.
2 Actually, the statistic we shall describe here is or lambda asymmetrical, a measure that tests association in only one direction (from the independent to the dependent variable). A test of mutual association, the true is also available.
TABLE 17.1 Paternal Basis for Party Identification

Respondent’s Party Identification
Father’s Party Identification
Dem.
Rep.
Ind.
Totals
Democratic
45
5
10
60
Republican
2
23
5
30
Independent
3
2
5
10
Total
50
30
20
100
But suppose we have a second piece of information—the party identification of each respondent’s father—with the following frequency distribution:
Democrats
60
Republicans
30
Independents
10
If these two variables are related to each other—that is, if one is likely to have the same party identification as one’s father—then knowing the party preference of each respondent’s father should help us to improve our guessing of that respondent’s own preference. This will be the case if, by guessing for each respondent not the mode of the overall distribution, as we did before, but simply that person’s father’s party preference, we can reduce our incorrect predictions to fewer than the 50 cases we originally guessed wrongly.
To examine a possible association between these variables, we construct a crosstab summarizing the distribution of cases on these two variables. In Table 17.1, the independent, or predictor, variable (father’s party identification) is the row variable, and its overall distribution is summarized to the right of the table. The dependent variable (respondent’s party identification) is the column variable, and its overall distribution is summarized below the table. The numbers in the cells have been assigned arbitrarily, although in the real world they would, of course, be determined by the research itself.
With this table we can use parental preference to predict respondent’s preference. To do this, we use the mode just as before, but apply it within each category on the independent variable rather than to the whole set of cases. Thus, for those respondents whose father is identified as a Democrat, we guess a preference for the same party. We are correct 45 times and incorrect 15 (for the 5 Republicans and 10 Independents). For those whose father is identified as a Republican, we guess Republican. We are correct 23 times and incorrect 7. And for those whose father is identified as an Independent, we guess a similar preference and are correct 5 out of 10 times. Combining these results, we find that we are now able to guess correctly 73 times and are still wrong 27 times. Thus, knowledge of the second variable has clearly improved our guessing. To ascertain the precise percentage of that improvement, we use the general formula for a coefficient of association:

In the present instance, this is

By using father’s party identification as a predictor of respondent’s party identification, we are able to improve (reduce the error in) our guessing by some 46 percent.
The formula for calculating ?, which will bring us to the same result though by a slightly different route, is

Lambda ranges from 0 to 1, with higher values (those closer to 1) indicating a stronger association. Because nominal variables have no direction, ? will always be positive.
Our next step is to decide whether the relationship summarized by ? arises from a true population parameter or from mere chance. That is, we must decide whether the relationship is statistically significant.
Chi-Square
The test of statistical significance for nominal variables is X2 (chi-square). This coefficient tells us whether an apparent nominal-level association between two variables, such as the one we have just observed, is likely to result from chance. It does so by comparing the results actually observed with those that would be expected if no real relationship existed. Calculating X2 too, begins from a crosstab. Consider Table 17.2, which resembles Table 17.1 in that the marginals for each variable are the same as those of Table 17.1, but Table 17.2 does not include any distribution of cases within the cells.
To begin the determination of X2 we ask ourselves what value is expected in each cell, given these overall totals, if there is no association between the two variables. Of the 60 cases whose father was a Democrat, for instance, we might expect half (50/100) to be Democrats, almost a third (30/100) to be Republicans, and one in five (20/100) to be Independents, or, in other words, 30 Democrats, 18 Republicans, and 12 Independents. Similarly, we might arrive at expected values for those with a Republican or Independent father. These expected values are summarized in Table 17.3.
TABLE 17.2 Paternal Basis for Party Identification: Marginal Values

Respondent’s Party Identification
Father’s Party Identification
Dem.
Rep.
Ind.
Totals
Democratic

60
Republican

30
Independent

10
Total
50
30
20
100
TABLE 17.3 Paternal Basis for Party Identification: Expected Values

Respondent’s Party Identification
Father’s Party Identification
Dem.
Rep.
Ind.
Totals
Democratic
30
18
12
60
Republican
15
9
6
30
Independent
5
3
2
10
Total
50
30
20
100
The question then becomes, are the values we have actually observed in Table 17.1 so different (so extreme) from those that Table 17.3 would lead us to expect if there were, in reality, no relationship between the two variables, that we can be reasonably confident of the validity of our result? Chi-square is a device for comparing the two tables to find an answer to this question. The equation for X2 is

We calculate X2 by filling in the values in Table 17.4 for each cell in a given table. The ordering of the cells in the table is of no importance, but fo from Table 17.1 and fe from Table 17.3 for any particular line must refer to the same cell. The rationale for first squaring the differences between fo and fe and then dividing by fe is essentially the same as that for the treatment of variations around the mean in determining the standard deviation. Chi-square is determined by adding together all the numbers in the last column. In the example, this yields a value of 56.07.
TABLE 17.4 Values Used in Deriving X2
fo
fe
fo – fe
(fo – fe)2

45
30
15
225
7.50
5
18
–13
169
9.39
10
12
–2
4
.33
2
15
–13
169
11.27
23
9
14
196
21.78
5
6
–1
1
.17
3
5
–2
4
.80
2
3
–1
1
.33
5
2
3
9
4.50
Degrees of Freedom
Before interpreting this number, we must make one further calculation, that of the so-called degrees of freedom. The degrees of freedom (df) in a table simply consist of the number of cells of that table that can be filled with numbers before the entries in all remaining cells are fixed and unchangeable. The formula for determining the degrees of freedom in any particular table is
df = (r – 1) (c – 1)

In the example, df = (3 – 1)(3 – 1) = 4
We are now ready to evaluate the statistical significance of our data. Table A.4 in Appendix A summarizes the significant values of for X2 different degrees of freedom at the .001, .01, and .05 levels. If the value of X2 we have calculated (56.07) exceeds that listed in the table at any of these levels for a table with the specified degrees of freedom (4), the relationship we have observed is statistically significant at that level. In the present instance, for example, in order to be significant at the .001 level—that is, if when we accept the observed association as representative of the larger population we run a risk of being wrong one time in 1,000—our observed X2 must exceed 18.467. Since it does so, we are quite confident in our result.
MEASURES OF ASSOCIATION AND SIGNIFICANCE FOR ORDINAL VARIABLES: GAMMA
A widely used coefficient of association for ordinal variables is G, or gamma, which works according to the same principle of error reduction as ? but focuses on predicting the ranking or relative position of cases rather than simply their membership in a particular class or category. The question treated by G is that of the degree to which the ranking of a case on one ordinal variable may be predicted if we know its ranking on a second ordinal variable.
When examining two such variables, there are two possible conditions of perfect predictability. The first, in which individual cases are ranked in exactly the same order on both variables (high scores with high scores, low scores with low), is termed perfect agreement. The second, in which cases are ranked in precisely the opposite order (highest scores on one variable with lowest on the other and the reverse), is termed perfect inversion. Therefore, predictability is a function of how close the rankings on these variables come to either perfect agreement (in which case G is positive and approaches 1) or perfect inversion (where G is negative and approaches –1). A value of G equal to 0 indicates the absence of association. The formula for calculating G is

G is based on the relative positions of a set of cases on two variables. The cases are first arranged in ascending order on the independent variable. Their rankings on the dependent variable are then compared. Those for which the original ordering is preserved are said to be in agreement, and those for which the original order is altered are said to be in inversion. Limitations of space do not permit us to consider this procedure in detail or to discuss the calculations of G when the number of cases is relatively small and/or no ties are present in the rankings. Rather, we shall focus on the procedures for calculating G under the more common circumstances, when ties (more than one case with the same rank) are present and the number of cases is large.3 Here, as before, we work from a crosstab, as shown in Table 17.5.
TABLE 17.5 Centralized Crosstabulation

Dependent Variable
Independent Variable
Low
Medium
High
Low
a
b
c
Medium
d
e
f
High
g
h
i
To measure the association between these two variables, we determine the number of agreements and inversions relative to each cell in the table. An agreement occurs in any cell below (higher in its score on the independent variable) and to the right (higher in its score on the dependent variable) of the particular cell in question. Thus, agreements with those cases in cell a include all cases in cells e, f, h, and i, since these cases rank higher than those in cell a on both variables. An inversion occurs in any cell below (higher in its score on the independent variable) and to the left (lower in its score on the dependent variable) of the particular cell in question. Thus, inversions with those cases in cell c include all cases in cells d, e, g, and h, since these cases rank higher on one variable than those in cell c, but lower on the other. The frequency of agreements (in the equation), then, is the sum for each cell of the number of cases in that cell multiplied by the number of cases in all cells below and to the right (a[e + f + h + i] + b[f + i] + d[h + i] + e[i]). The frequency of inversions (fi in the equation) is the sum for each cell of the number of cases in that cell multiplied by the number of cases in all cells below and to the left (b[d + g] + c[d + e + g + h] + e[g] + f[g + b]) The resulting totals are simply substituted into the equation.
If, for example, the variables in Table 17.1 were ordinal, we could calculate G as follows:

3 In such applications, G may be unreliable, but it is included here to facilitate the discussion of association as a concept. A related statistic, Kendall’s tau, may be more reliable, but its determination may be less intuitive to the beginning political scientist.
This tells us that there is 61 percent more agreement than disagreement in the rankings of the cases on the two variables. If fi exceeded fa the sign of G would be negative, in order to indicate the existence of an inverse relationship.
The test of the statistical significance of G is based on the fact that the sampling distribution of G is approximately normal for a population with no true association, as was the sampling distribution of the hypothetical coefficient of association discussed earlier. Since this is so, we can determine the probability that any particular value of G has occurred by chance by calculating its standard score (z), locating its position under the normal curve, and assessing the probabilities. The actual calculation of ZG (standard score of gamma) will not be presented here, because the formula is complex and its understanding requires a more detailed knowledge of statistics than we have provided. Suffice it to say that when ZG exceeds ±2.326 (when G lies at least 1.645 standard deviation units above or below the mean), G is sufficiently extreme to merit a significance level of .05, and that when ZG exceeds ±2.326 (when G lies at least 2.326 standard deviation units above or below the mean), G achieves significance at the .01 level. The interpretation of these results is precisely the same as that in the earlier and more general example.
MEASURES OF ASSOCIATION AND SIGNIFICANCE FOR INTERVAL/RATIO VARIABLES: CORRELATION
The measure of association