Choosing the Number of Categories in Agree–Disagree Scales

Although agree–disagree (AD) rating scales suffer from acquiescence response bias, entail enhanced cognitive burden, and yield data of lower quality, these scales remain popular with researchers due to practical considerations (e.g., ease of item preparation, speed of administration, and reduced administration costs). This article shows that if researchers want to use AD scales, they should offer 5 answer categories rather than 7 or 11, because the latter yield data of lower quality. This is shown using data from four multitrait-multimethod experiments implemented in the third round of the European Social Survey. The quality of items with different rating scale lengths were computed and compared.


Introduction
Although agree-disagree (AD) rating scales have been extremely popular in social science research questionnaires, they are susceptible to a host of biases and limitations. First, they are susceptible to acquiescence response bias (Krosnick 1991): Some respondents agree with the statement offered regardless of its content. For instance, if the statement is ''Immigration is bad for the economy,'' acquiescence bias will lead to more negative opinions being expressed than if the statement is ''Immigration is good for the economy.'' Some authors explain this tendency by people's natural disposition to be polite (e.g., Goldberg 1990); others believe that some respondents perceive the researchers to be experts and assume that if they make an assertion, it must be true (Lenski and Leggett 1960); still others attribute acquiescence to survey satisficing, a means of avoiding expending the effort needed to answer a question optimally by shortcutting the response process (Krosnick 1991). A recent study (Billiet and Davidov 2008) shows that acquiescence is quite stable over time, supporting the idea that acquiescence is a personality trait and not a circumstantial behavior.
Another drawback of AD scales is the imprecise mapping of the response dimension onto the underlying construct of interest which leads to a more complex cognitive response process. This can be illustrated by breaking down the response process for AD scales into several steps. The classic decomposition comes from Tourangeau, Rips, and Rasinski (2000) who divide the question-answering process into four components: ''comprehension of the item, retrieval of relevant information, use of that information to make required judgments, and selection and reporting of an answer.'' Other authors, however, propose a slightly different decomposition focused on AD scales specifically (Carpenter and Just 1975;Clark and Clark 1977;Trabasso, Rollins, and Shaughnessy 1971): comprehension of the item, identification of the underlying dimension, positioning oneself on that dimension, and selecting one of the AD response options to express that position. This last step is potentially the problematic one (Fowler 1995;Saris et al. 2010) since the translation of a respondent's opinion into one of the proposed response categories is not obvious. For example, if the statement is ''Immigration is bad for the economy,'' and the respondent thinks that it is extremely bad, he or she may disagree with the statement, since the statement does not express his or her view. However, people may also disagree if they believe that immigration is good or very good for the economy or if they believe it is neither good nor bad (Saris and Gallhofer 2007). The AD scale may therefore mix people who hold very different underlying opinions into the same response category. As a result, the relationship of the response scale to the underlying construct is not monotonic in terms of expressing beliefs about the impact of immigration on the economy. 1 More generally, with AD scales, people can do the mapping in their own way and this may create method effects (see e.g., Saris at al. 2010, for more details).
Despite this issue, AD scales are still used quite often, probably for practical reasons. The same scale can be used to measure a wide array of constructs, and visual display of the scale is easy on paper questionnaires or in web surveys. Administration of the questionnaire is also easier and quicker, since the scale needs only to be explained once to the respondent, whereas with Item-Specific (IS) scales, a new rating scale must be presented for each item. For these reasons, AD scales may entail lower costs (e.g., less paper needed, less work for the interviewers, less preparation cost), which is always tempting. Furthermore, the long tradition of using AD scales in the social sciences may inspire researchers to reuse established batteries of items using this response format, even if they yield lower quality data.
Given the popularity of this measurement approach, researchers must decide the number of points to offer on an AD rating scale. Likert (1932) proposed that these scales should offer five points, but Dawes (2008) recently argued that comparable results are obtained from 7-to 10-point scales, which may yield more information than a shorter scale would. Indeed, the theory of information states that if more response categories are proposed, more information about the variable of interest can be obtained: For instance, a 2-point scale only allows assessment of the direction of the attitude, whereas a 3point scale with a middle category allows assessment of both the direction and the neutrality; even more categories can also allow assessment of the intensity, and so on (Garner 1960).
Some empirical results seem to support this theory. For instance, Alwin (1992) considers a set of hypotheses related to this theory of the information. Testing them with panel data, he finds that except for the 2point scales, ''the reliability is generally higher for measures involving more response categories'' (p. 107). Many articles have been written discussing consequences of increasing the number of categories. However, only a limited number of studies compare the quality of scales of different lengths, where quality refers to the strength of the relationship between the observed variable and the underlying construct of interest (e.g., Andrews 1984;Scherpenzeel 1995;Költringer 1993;Alwin 1997;Alwin 2007).
In this article, we discuss the effect of the number of response categories on the quality of AD scales. These scales may behave in a specific way, because of the cognitive response process involved (which includes an extra step to map the underlying opinion onto one of the offered response categories). In one other study on this issue, Alwin and Krosnick (1991) compared 2-point and 5-point AD scales with respect to quality and found that the 2-point scales had better quality than the 5-point scales.
In our study, we compared 5-point AD scales with longer scales in terms of measurement quality. The study does not test the impact, for instance, of having only the end points labeled versus having all points labeled, nor does it test the impact of asking questions in battery style versus asking them separately. Another specificity of this study is that it involves data collected during the third round (2006)(2007) of the European Social Survey (ESS) on large and representative samples in more than 20 countries.
We begin below by describing the analytical method used to assess quality. Then, we describe the ESS data analyzed using this method, the results obtained, and their implications.

Analytical Method
Our analysis involves two steps. The first step is to compute the reliability, validity, and quality coefficients of each item, using a Split-Ballot Multitrait-Multimethod design (SB-MTMM) as developed by Saris, Satorra, and Coenders (2004). The item-by-item results are then analyzed by a metaanalytic procedure to test the hypotheses of interest.
The idea to repeat several traits, measured with different methods (i.e., MTMM approach), has been proposed first by Campbell and Fiske (1959). They suggested summarizing the correlations between all the traits measured with all the methods into an MTMM matrix, which could be directly examined for convergent and discriminant validation. About a decade later, Werts and Linn (1970) and Jöreskog (1970Jöreskog ( , 1971 proposed to treat the MTMM matrix as a confirmatory factor analysis model, whereas Althauser, Heberlein, and Scott (1971) proposed a path analysis approach. Alwin (1974) presented different approaches to analyze the MTMM matrix. Andrews (1984) suggested applying this model to evaluate the reliability and validity of single-survey questions. Alternative models have been suggested (Browne 1984;Cudeck 1988;Marsh 1989;Saris and Andrews 1991). Corten et al. (2002) and Saris and Aalbers (2003) compared different models and concluded that the model discussed by Alwin (1974) and the equivalent model of Saris and Andrews (1991) fit best to several data sets.
In the classic approach, for identification reasons, each item is usually measured using at least three different methods (e.g., question wordings). However, this may lead to problems if respondents remember their answer to an earlier question when they answer a later question that measures the same construct. This problem has been studied by Van Meurs and Saris (1990).
In the study by Van Meurs and Saris (1990), several questions were repeated after different time intervals in the same questionnaire and after two weeks. The authors first determined how much agreement one can expect if there is no memory effect. This is defined as the level of agreement between the repeated observations that remains stable even if the time lag between the repeated questions is increased. Once this is determined, one can evaluate the minimal time interval between the repetitions necessary to reach the amount of agreement typical for the situation of no memory effect. Van Meurs and Saris found that: 1. People who expressed extreme opinions in the first interview always gave the same answer no matter the time interval between the repeated questions. So enlarging the time interval would not alter the apparent overtime consistency of these people's answers. This is not surprising: These people presumably do not give the same answer because they remember their previous answer and repeat it. It is more likely that they do so because they have highly stable opinions and report them accurately.
2. If a person did not express an extreme opinion, and the questions intervening between the repeated questions were similar to the repeated question, then the observed relation was as follows: where C is the percentage matching answers and T is the time in minutes between the two repetitions. In this case, every extra minute in the time interval reduced the percentage of matching answers by approximately 1 percent. This means that after 25 minutes, the percentage of matching answers should be about 36 percent, which Van Meurs and Saris (1990) said is the percentage to be expected if people do not remember their previous answer.
3. If a person did not express an extreme opinion, and the questions intervening between the repeated questions were not similar to the repeated question, then the relationship was as follows: C ¼ 75:4 À :50T : In this case, the extra minute of delay of the repeated question reduced memory by only half a percentage. Therefore, the level of 36 percent of matching answers would be reached after 80 minutes.
This result has been questioned by Alwin (2011), who studied memory effects by doing a word memory experiment wherein people were exposed to 10 words, and memory was tested immediately after exposure and again after 10 minutes. He concludes (Alwin 2011:282-84) that ''if one looks at the delayed task and focuses solely on those words produced in response to the immediate recall task, the impression one gets is that within the context of the survey, people remember what they said earlier.'' This raises the need to do further research on the topic, to see whether MTMM results are distorted by memory.
Another way to limit the memory problem is to reduce the number of repetitions of the same measures in different forms. This approach, called split-ballot multitrait-multimethod approach (SB-MTMM), was developed by Saris, Satorra, and Coenders (2004). In such a design, respondents are randomly assigned to different groups, with each group receiving a different version of the same question. For example, the versions can vary in terms of the number of answer categories offered (e.g., one group receives a 5-point and a 7-pont scale; another receives a 7-point and a 11-point scale; and still another receives an 11-point and a 5-point scale). This reduces the number of repetitions: Each respondent answers only two versions of the question instead of three (Saris, Satorra, and Coenders, 2004). A memory effect is still possible, but with only two repetitions, it is less probable, also because the time between the first and the second form can be maximized.
Using this design and structural equation modeling techniques, the reliability, validity, and quality coefficients can be obtained for each question, as long as at least three different traits are measured and two methods are used to measure each trait in each group. Various models have been proposed; we use the true score model for MTMM experiments developed by Saris and Andrews (1991): where: Y ij is the observed variable for the i th trait and the j th method. T ij is the systematic component of the response Y ij . e ij is the random error component associated with the measurement of Y ij for the i th trait and the j th method. F i is the i th trait. M j represents the variation in scores due to the j th method. m ij is the method effect for the i th trait and the j th method.
The model needs to be completed by some assumptions: The trait factors are correlated with each other. The random errors are not correlated with each other nor with the independent variables in the different equations. The method factors are not correlated with each other nor with the trait factors. The method effects for one specific method M j* are equal for the different traits T ij*. The method effects for one specific method M j* are equal across the split-ballot groups; as are the correlations between the traits and the random errors. Figure 1 illustrates the logic of this model in the case of two traits measured with a single method.
Working with standardized variables, we have: It follows that the total quality of a measure is: It corresponds to the variance of the observed variable Y ij explained by the variable of interest F i .
As the model in Figure 1 is not identified, it is necessary to estimate the parameters of a slightly more complicated model (one model with more traits and more methods). Figure 2 presents a simplified version of the model, omitting, for the sake of clarity, the observed variables, and the random errors associated with each true score.
We used the LISREL multigroup approach to estimate the model's parameters (Jöreskog and Sörbom 1991). The input instructions are shown in the Appendix (which can be found at http://smr.sagepub.com/supplemental/). The initial model was estimated for all countries and all experiments, but some adaptations for particular countries were made when misspecifications were present in the models. The main adaptations were the freeing of some of the method effects (i.e., allowing a method factor to have different impacts on different traits), and fixing a method variance at zero when its unconstrained variance was not significant and negative. All the adaptations of the initial model in the different countries and for the four different experiments (each column corresponds to an experiment) are available on the Internet. 2 In order to determine what modifications were necessary for each model, we tested for misspecifications using the JRule software (Van der Veld, Figure 1. Illustration of the true score model. Saris, and Satorra 2008). This testing procedure developed by Saris, Satorra, and Van der Veld (2009) is based on an evaluation of the expected parameter changes (EPC), the modification indices (MI), and the power. The procedure thus takes into account both type I and type II errors as shown in Table 1, unlike the chi-square test, which only considers type I errors. Another advantage is that the test is done at the parameter level and not at the level of the complete model, which is helpful for making corrections (for more details about the statistical justification of our approach, see Saris, Satorra, and Van der Veld 2009).
We tried, as much as possible, to find a model that fits in the different countries (i.e., to make the same changes for one experiment in the different countries, for instance, to fix the same method effect to zero each time). Nevertheless, this was not always possible, resulting in several models specific to certain countries or groups of countries. However, the differences between the models are often limited.

The ESS Round 3 MTMM Experiments
The ESS is a biannual cross-national project designed to measure social attitudes and values throughout Europe. 3 Third-round interviewing, with probability samples in 25 European countries, 4 was completed between September 2006 and April 2007. The one-hour questionnaire was administered by an interviewer in the respondent's home using show cards for most of the questions. The response rates varied from 46 percent to 73 percent between countries (cf. Round 3 Final Activity Report 5 ). Around 50,000 individuals were interviewed.
The survey administration involved a main questionnaire and a supplementary questionnaire, in which items from the main questionnaire were repeated using different methods. Four MTMM experiments, each involving four methods and three traits, were included in the third round of the ESS. Because of the Split-Ballot design, the respondents were randomly assigned into three groups (gp A, gp B, and gp C). All groups received the same main questionnaire, but each group received a different supplementary questionnaire, which included 4 experiments with a total of 12 questions (4 experiments Â 3 traits ¼ 12 repetitions). The four experiments were: dngval: deals with respondents' feelings about life and relationships, imbgeco: deals with respondents' position toward immigration and its impact on the country, imsmetn: deals with respondents' opinion about immigration policies (should the government allow more immigrants to come and live in the country?), lrnnew: deals with respondents' openness to the future. Table 2 gives a summary of the variables and methods used in the different Split-Ballot groups. The column ''meaning'' gives the statement for each variable proposed to the respondents in the AD questions. The statement may vary slightly in IS questions. The complete questionnaires are available on the ESS website. 6 The four last columns provide information about the methods used in each experiment. The column ''main'' refers to the method used in the main questionnaire of the ESS (M 1 ): It is therefore a method that all respondents receive. The next three columns indicate the second method that each Split-Ballot group received. Respondents were randomly assigned to one of these Split-Ballot groups (A, B, or C) and therefore, each person answered only one of these methods (M 2 or M 3 , or M 4 ). It is important to notice, however, that the methods vary from one experiment to another: That  is why in each of the four experiments (which correspond to different rows in Table 2) we can see four distinct methods (each method corresponding to a specific scale: a 5-point AD scale, an 11-point AD scale, etc.).
In all experiments, the 5-point AD scales propose the same categories: ''Agree strongly,'' ''Agree,'' ''Neither agree nor disagree,'' ''Disagree,'' ''Disagree strongly.'' All 5-point AD scales are fully labeled scales with the categories presented vertically, except for one case. On the contrary, all 7and 11-point AD scales are presented as horizontal rating scales and have only the end points labeled by: ''Agree strongly'' and ''Disagree strongly.'' The ESS questionnaire never offers the option ''Don't Know'' as a response. The interviewer will only code an answer as ''Don't Know'' if a respondent independently gives this response. Therefore, there are very few such answers: usually less than 2 percent (insignificant enough to be ignored in the analysis).
This design allows comparisons to be made between both repetitions of the questions for the same respondents (e.g., using M 1 and one of the three other methods) and between Split-Ballot observations (M 2 and M 3 , or M 2 and M 4 , or M 3 and M 4 ). Since the supplementary questions are asked at the end of the interview, some time effect could play a role (positive impact on the quality if respondents learn, or negative if they become less attentive and lose motivation) and explain differences in qualities between the different measures. Nevertheless, Table 2 shows that for two of the experiments (imbgeco and imsmetn) the variations in the lengths of the scales are present only in the supplementary experiments, therefore, timing is not an issue. In the two others (dngval and lrnnew), the 5-point AD scale in the main questionnaire is repeated in one of the groups in the supplementary questionnaires, so once again, we can and will focus in the analysis only on Split-Ballot comparisons and, so, no order or time effect can explain the quality variations.
The first form of the question is presented in the beginning of the main questionnaire and its repetition is presented in the supplementary questionnaire. The main questionnaire contained approximately 240 questions. The repeated question is separated by at least 200 questions. If we assume that people answer three to four questions per minute, the time between the questions is 50 and 70 minutes. Given that many of the questions in between are rather similar and the repeated question is in general not the same in form as the first question, a memory effect seems unlikely.
Besides that, memory effects cannot explain the differences found in the measures in the supplementary questionnaires, since all groups receive the same form in the main questionnaire. Therefore, if a memory effect is present, it should be the same for all groups. The only possible difference that can be anticipated is between the groups with an exact repetition and groups getting a different method the second time. In the case of the exact repetitions of the same questions in the main and the supplementary questionnaire, the quality may be higher the second time than with nonexact repetitions. This possibility would need to be kept in mind when interpreting our results.
Finally, it is noticeable that in the experiment called ''dngval,'' a 5-point AD scale is used both in groups A and B. However, these two scales correspond to two distinct methods, because they differ at some other levels: In group A, a battery is used, whereas in group B, each question is separated from the others; in group A, the response categories are presented horizontally, whereas in group B, they are presented vertically. These differences may lead to different quality estimates.

Adaptation of the Data for Our Study
First, we had to select only the observations that could be used for our study. Hungary did not complete the supplementary questionnaire, so we could not include it. Moreover, in some countries, the supplementary questionnaire was self-completed instead of being administered by an interviewer. In that case, some people answered it on the same day as the main questionnaire, but others waited one, two, or many more days. A time effect may intervene in these circumstances, because the opinion of the respondent can change, so we did not take the individuals who answered on different days into consideration (Oberski, Saris, and Hagenaars 2007). This led us to exclude Sweden from the data, due to the fact that no one there completed both parts of the questionnaire on the same day. In the other countries, the number of ignored observations (due to completion of the supplementary questionnaire on another day) was not very high, and we still had more than 45,000 observations for our study.
We then converted these data into the correlation or covariance matrices and means needed for each group and experiment. Because we had four methods and three traits, the matrices contain 12 rows and 12 columns. However, these matrices are incomplete, due to Split-Ballot design: Only the blocs (i.e., correlations or covariances) for the specific methods that each group receives are nonzero. These matrices were obtained using ordinary Pearson correlations and the pairwise deletion option of R for missing and ''Don't Know'' values. Results would be different if we had corrected the categorical character of questions in the correlations calculation as indicated in Saris, van Wijk, and Scherpenzeel (1998). However, as demonstrated by Coenders and Saris (1995), the measurement quality estimates would then have meant something different. Indeed, when polychoric correlations are used, 7 it is the measurement of the continuous underlying variable y* that is assessed, whereas when covariances or Pearson correlations are used, it is the measurement quality of the observed ordinal variable y which is assessed. Therefore, ''if the researcher is interested in measurement-quality altogether (including the effects of categorization), or in assessing the effects of categorization on measurement quality, the Pearson correlations should be used'' (Coenders and Saris 1995:141). This is exactly what we want to do, so following the authors' advice, Pearson correlations have been used.
The matrices for the different experiments and countries were analyzed in LISREL in order to obtain estimates for the coefficients of interest. For details on this approach, we refer to Saris, Satorra, and Coenders (2004). The number of 12 Â 12 matrices was 276 (for 23 countries, four experimental conditions, and three split-ballot groups).

Results
We computed the reliabilities, validities, and qualities for each method (four methods each time: M 1 to M 4 ), for each experiment (four experiments: ''dngval,'' ''imbgeco,'' ''imsmetn,'' and ''lrnnew''), each trait (three traits), and in each country (23 countries). This provided 1,104 reliability coefficients, 1,104 validity coefficients, and 1,104 quality coefficients. In order to obtain an overview, it was therefore necessary to reduce and summarize this huge amount of data.
First, we focused on the quality and not on the validity and reliability separately. Second, since we were interested in the AD scales, we kept only the observations for the AD scales when an experiment mixed methods with AD scales and methods with IS scales (cf. note 1 for a definition). Third, because of the possible time effect mentioned previously, and in order to isolate the effect of the length of the scale, we decided to focus only on comparison of the qualities of the Split-Ballot groups. Finally, we did not consider each trait separately, but computed the mean quality of the three traits. Table 3 presents the results obtained from this process. Table 3 shows that in only a minority of cases (17 of the 92 ¼ 18 percent) the mean quality does not decrease when the number of points on the scale increases. In other words, the main trend (in 82 percent of the cases) is as follows: the more categories an AD scale contains, the worse its mean quality is.
In order to have a more general view of the number of points' effect on quality, we also considered the mean quality depending on the number of categories across countries. The last row of Table 3 reflects this information. The decline across countries is quite clear. For example, in the experiment  To come back to the question of potential memory effects, studying this table, one can notice that the highest quality is found for the 5-point AD scales in the two experiments (''lrnnew'' and ''dngval'') with exact repetitions, which is what one would expect if memory effects lead to reduced errors. However, the general trend is similar in the experiments using a 5-point AD scale in the main questionnaire and those using IS scales. The same order of quality is found for all four topics, it does not matter if there is an exact repetition or not.
In order to aggregate our findings further, we considered the mean quality across countries, experiments, and methods. This allowed us to make a distinction between reliability and validity while maintaining a clear overview. Table 4 confirms the trend noted above and also shows that when a 7-point AD scale is chosen instead of a 5-point AD scale, the mean quality declines by 0.139. This is quite an important reduction in quality significant at 5 percent (a t test for differences in means gives a p value of .000). Moving from 7 to 11 categories also leads to a decrease of mean quality, but here it is very small (.011) and not significant at 5 percent (p value ¼ .500). Interestingly, the difference between the 5-and 7-point scales is much larger than the difference between 7-and 11-point scales (not significant) although the difference in number of categories is smaller (two vs. four). It seems that seven response categories are already too many, and adding more does not produce any noticeable changes.
Looking at reliability and validity separately, one can see the robustness of reliability in terms of variations in the number of categories (t tests show that there are no significant differences between the three means, with p values of .93 and .66, respectively, for the test between 5-and 7-point and 7-and 11-point scales). However, validity is quite sensitive, as is quality, to the number of categories and changes: The difference in means between a 5-and a 7-point scale is quite high (0.198) and significant at 5 percent, whereas the difference between a 7-and an 11-point scale is very small (0.024) and not significant. The reduction in total quality is clearly due to the decrease in the validity. The validity is v 2 ij ¼ 1 À m 2 ij . This means that the method effects increase, as the number of categories increases, causing the observed quality loss.

Discussion and Further Research
The quality coefficients computed above show the same trends clearly appear at different levels of aggregation: On an AD scale, the quality decreases as the number of categories increases, so that the best AD scale is a 5-point one. This contradicts the main statement of the theory of information, which as mentioned previously, argues that more categories mean more information about the variable of interest. In terms of quality of measurement, 5-point scales yield better quality data. Our suggestion is, therefore, to use 5-and not 7-point scales.
This result is noteworthy because the choice of the number of response categories is consequently related to correlations between variables. For example, if we focus on two factors (e.g., the two first traits of the ''imbgeco'' experiment), as shown in Figure 1, the correlation between the observed variables is given by: If we assume that r 1j ¼ r 2j ;v 1j ¼ v 2j and m 1j ¼ m 2j , and that the true correlation is rðF 1 ;F 2 Þ ¼ 0:4, then: rðY 1j ;Y 2j Þ ¼ 0:4q 2 þ r 2 ð1 À v 2 Þ: If a survey uses a 5-point AD scale, using that scale's mean quality given in Table 4, it is expected that the correlation between the observed variables will be: rðY 1;5AD ;Y 2;5AD Þ ¼ 0:4 Â 0:533 þ 0:717 Â ð1 À 0:753Þ ¼ 0:213 þ 0:177 ¼ 0:39: The first term of the sum illustrates the decrease in the observed correlation due to the relatively low quality. The second term shows the increase in observed correlation due to high method effects. However, if another survey asks the same questions but uses a 7-point AD scale, the observed correlation becomes: rðY 1;7AD ;Y 2;7AD Þ ¼ 0:4 Â 0:394 þ 0:716 Â ð1 À 0:555Þ ¼ 0:157 þ 0:318 ¼ 0:48: Now the first term is even lower, since the quality is lower, whereas the second term is higher, since the method effects are higher overall, this leads to a higher observed correlation. For the 5-point scale, 0.177 of the observed correlation is due to the method and has no substantive relevance. For the 7-point scale, this is even 0.318 which is due to the method. This example is simplistic because only the mean quality is used. Of course, depending on the specific traits of interest and depending on the country studied, the effects might be less, or more, than those computed. However, it gives an idea of the chosen scale's importance and its possible consequences on the analysis: Depending on the method, even if the true correlation is the same, the observed correlations may be different; they might also be different from the true correlation. The decomposition of the observed correlation also demonstrates that this correlation is really unstable, because it depends on a combination of quality and method effects.
Because decrease in total quality is mainly due to decrease in validity, method effects are greater when the number of response categories is higher. This can be explained by a systematic but individual interpretation and use of AD scales: Each person uses the scales in a different way from other persons, but the same person uses the scale in the same way when answering different items. Because more variations in a personal interpretation of the scale are possible with more categories, providing a scale with more categories leads to more method effects, and hence to lower validity and lower quality.
The results are quite robust in different countries, for different experiments, and for different traits. It is therefore possible to give some general advice: Regardless of the country, regardless of the topic, and despite what the information theory states, there is no gain in information when an AD scale with more than five categories is used. There is, instead, a loss of quality. That is why if AD scales must be used, we recommend that they contain no more than five response categories.
However, this study has some limits. Even if the amount of data used is huge, the specific design of the available experiments still limits the possible analyses. There are two specific points (impossible to test in our study because the necessary data were unavailable) that we think should be examined: the first is the interest in having other numbers of categories. In the third round of the ESS, only 5-, 7-, and 11-point scales were present in the MTMM experiments. This is too limited. The 8-or 9-point scales may confirm the tendency that using more response categories does not improve the quality, but this should, nonetheless, be tested. A test of scales containing fewer categories would also be particularly interesting. Perhaps the tendency is not the same when there are very few categories. For instance, is a 2-point scale (''Disagree'' vs. ''Agree'') better than the 5-point scale used in the ESS round 3? As we have mentioned previously, such a comparison was done by Alwin and Krosnick (1991), and they found that the 2point scale had better quality than the 5-point scale. However, in this case, one should consider as well that such a dichotomous scale, lacking a middle category, may lead to higher nonresponse rate. We do not know what happens if 3-or 4-point scales are used. So, further research is required for AD scales to discern what the optimal number of categories is. Since we had no data to test this, we must qualify our statement with more precision: An AD 5-point scale appears to be better than an AD 7-or 11-point scale, so employing more than five categories in an AD scale is not recommended, although, perhaps, scales with even fewer categories might result in better quality and validity.
Furthermore, in round 3 of the ESS, the 5-point scale was always completely labeled, whereas only the end points of the 7-and 11-point scales were labeled. The comparison of 7-and 11-point scales can therefore be made ceteris paribus, and as mentioned previously, shows no significant difference in the measurement's total quality. However, we cannot distinguish between the effect of the number of categories and the effect of labels in the comparison between the 5point scale, on one hand, and the 7-and 11-point scales, on the other.
Previous research nevertheless gives us some information about the potential effect of labeling on the quality. Andrews (1984), using an MTMM approach and model, finds a negative impact of labeling: The reliability is lower for fully labeled scales compared to partially labeled ones. Alwin's (2007:87-88) MTMM studies comparing fully and partially labeled scales showed that the effect of full labeling on the quality (b t ) was negative. But Alwin (2007:200-2) also reports analyses of panel studies data using a quasi-simplex model for the estimation: There the effect of labeling is positive. Also, these analyses do not control for other elements of question design. Saris and Gallhofer (2007) in their meta-analysis control for many other characteristics and found a positive impact of labels. When a completely labeled scale is used instead of a partially labeled scale, the reliability coefficient in general increases by 0.033, whereas the validity coefficient decreases by 0.0045. This result is in line with findings reported by Krosnick and Berent (1993).
We used Saris and Gallhofer's MTMM results and the reliability and validity found in our study for a partially labeled 7-point AD scale (cf. Table 4) in order to compute the anticipated quality for a completely labeled 7-point AD scale. The expected value of the reliability coefficient is indeed: r 7pts, all labels ¼ (mean reliability coefficient found in our study for a 7-point scale with only the end point labeled þ increase of the reliability coefficient expected if the scale would have all points labeled, based on Saris and Gallhofer's estimate). A similar formula can be obtained for the validity coefficient. Finally, we have: q 2 7pts; all labels ¼ ð ffiffi ffi 0 p :716 þ 0:033Þ 2 Â ð ffiffi ffi 0 p :555 À 0:0045Þ 2 ¼ 0:424: