What is the K scale really about?


Alex B. Caldwell, Ph.D.


Rather immediately in the development of the basic MMPI scales - in the early 1940's - it became apparent that answering such an array of personal items is inevitably subject to biases from the varied attitudes and approaches that subjects take. One alternative was for the clinician to make approximate judgmental efforts to avoid unfortunate over-interpretations of the scores of self-critical respondents as well as serious under-interpretations of the scores of guarded and defensive respondents, but this introduced an unacceptable amount of error in the unreliability of such judgments. They felt, therefore, that the scales they had developed had to have measured adjustments to "correct" for these biases. In a highly regarded and remarkably thought-provoking talk, Meehl and Hathaway (1946; republished in Dahlstrom & Dahlstrom, 1980) detailed their efforts to quantify the potentially distorting effects of such biases and attitudes. In their article, "The K factor as a suppressor variable," they published the development of all three basic "validity" scales, L, F, and K. Here I will discuss the origins of L and F briefly and then K in more detail.


Where did the L scale come from?


The L scale was an explicitly a priori scale influenced in part by prior studies of honesty in grade school students. For this scale the challenge was to create a set of statements that were: (1) usually too good to be true and (2) rarely responded to by normal subjects. In the aggregate, then, it would be extremely rare for anyone to sincerely answer a large proportion of them in the scored direction. (As a marker of their success, there are only 15 items. I can recall having seen a raw score of all 15 items three or four times in my life from tens of thousands of profiles.) This scale immediately became useful in detecting efforts to "look good" in less sophisticated subjects, but it was quite uneven to ineffective with college-educated subjects. In my experience, it is also confounded by having two contrasting sources: (1) deliberate faking good and (2) a naive properness in less educated persons who may have high, rigid, and literal-minded religious beliefs or other strict personal values. (Because the second of those two contingencies is sincere responding, I never use "Lie scale" as the automatic designation of the L scale.)


What is the origin of the F scale?


The F scale was also a non-criterion group scale. It was the selection of 64 items (60 in the MMPI-2) answered in the scored direction (true or false) by less than 10% of their normal sample, and usually by less than 5%. It is an "infrequent response" scale; the letter F could be understood as standing for "frequency/infrequency," but that seems clumsy. Better, I believe just to think "the F scale" as well as just "the L scale." The premise was basically that a large number of rarely made responses alerts us to the fact that something may be going wrong with either the person’s approach to the items or with our tabulation of the person’s responses. This scale can also be elevated by a variety of biases and sources of distortion. For present purposes the following list is what I believe to be the three main operating elements: (1) deliberate attempts to "look bad," (2) marginal or overtly psychotic ideation, and (3) socioeconomic status/education


(discussed below). Secondarily there are such factors as limited literacy, intoxication when taking the test, non-psychotic idiosyncracies of thinking, perhaps mistaken understandings of the instructions, etc.


Besides Potassium, what does K stand for?


The development of the K scale was far more complex. The idea of a non-obvious scale to measure the overall tendency to "look good" or "look bad" seemed reasonably straightforward, but the requirements they put on it made it anything but. These included: (1) They wanted a scale that "worked at both ends," i.e., effectively discriminated both "looking good" and "looking bad" (many of their preliminary scales worked reasonably well in one direction but less well in the other). (2) They wanted a rigorous empirical selection of the scale items from a large item pool, partly because, "Those items whose significance would not have been guessed by the test-maker will then be equally mysterious to the testee" (see the "K factor article" in Dahlstrom & Dahlstrom, 1980, p. 86). Thirdly, since their goal was a scale to be used to correct their clinical scales for ever-present "look good" or "look bad" tilts or dispositions, they wanted the correction scale not to be weighted with or biased by psychopathology-ideally not at all. This combination of requirements - especially the last - set up a major set of hurdles.


In over two years of intensive work, they developed an untold number of experimental scales (too many, they wrote, to report in detail). There were conscious, i.e., by instructions, fake good and fake bad scales, and they also generated presumptively self-negative scales (functioning normals with disturbed profiles) and presumptively self-favorable scales (psychiatric inpatients with normal range profiles), the latter two sets making assumptions as to the direction of distortion but not as to the extent of conscious intentionality. They finally settled on a 22 item scale derived from a group of 50 inpatients, mostly with diagnoses of "psychopathic personality, alcoholism, and allied descriptive terms indicating behavior disorders rather than neuroses" (Dahlstrom & Dahlstrom, 1980, p. 99) as the best of the lot, although they adamantly stressed that it was the performance of the set of items that mattered and not the group of origin. This 22 item scale was designated L6 as a variation that happened to have a requirement of L at or over T-60. (This scale barely beat out scale N, which was Meehl’s own doctoral dissertation, but which scale contained a bit too much loading of psychopathology.)


This preliminary scale L6 still had a serious defect: a subset of (more or less) psychotically disturbed and severely depressed patients consistently got low raw scores reflecting their very low self-esteem. Therefore, L6 still remained undesirably influenced by psychopathology. From their scales for conscious distortion, they set out to identify a set of items that were not influenced by instructions to fake in either direction. From this set they found eight items that nevertheless discriminated the severely disturbed patients from the normals. These eight were scored in the patient response direction. The effect of adding these eight items to the 22 L6 items was to bring the average patient raw score on the 30 items back up to that of the average score of normal subjects. Thus, the final K scale and the K-correction appear to be no more than minimally affected by the presence of psychopathology.


I would see the goal of the K-correction in effect as being to identify an optimal estimation of what the T-score would have been had the person been straightforward. K then is operating as a threshold for the reporting of self-negative feelings and socially problematic attitudes. The basic clinical scale items to which the high K person responded in the scored direction despite a strongly self-favorable bias (whether conscious or not) would then, in effect, carry much more weight per item, that is, reporting distresses and shortcomings despite a strong reluctance to do so. This is compensated for by adding an above average amount of K. Similarly, the basic scale items responded to by someone with a low K score have a much lower threshold for their admission. These should be given less weight, and this occurs as the consequence of adding a smaller than average amount of K. This has a balancing and I believe beneficially homogenizing effect on who is included in which codetype: it becomes the optimal estimation as to which is the person’s appropriate codetype. Note that the non-K-corrected codetypes would quite often be different from the K-corrected (Wooten, 1984). There is little or no research on those non-K codetypes, and their test results would be much more confounded by test-taking-attitudes than are the K-corrected codetypes.


It was not factor derived; why was it called the K factor?


In the "K factor" article Meehl and Hathaway (1946; see Dahlstrom & Dahlstrom, 1980) went on to a factor analysis of a curious group of clinical and arbitrary "variance analyzing" scales. In this analysis the K scale emerged as central to a single factor with negligible residuals. They then went on to argue that there is too much imprecision in our measurement of personality to sacrifice any accuracy for the sake of internal consistency, i.e., factorial purity. Indeed, they argued that, "From both the logical and statistical points of view, the best set of behavior data from which to predict a criterion is the set of data which are among themselves not correlated." (op cit., p.117; see also McGrath, 2005). They fundamentally rejected the construction of personality scales on a factor analytic basis, and they concluded, "Since scales are so very ‘impure’ at best, there does not seem to be any very cogent reason for sacrificing anything in pursuit of the rather illusory purity involved." (op cit., p. 116). To my awareness, this carefully developed argument has never been refuted; instead it has been ignored for decades with endless factor-analytic (high alpha) test construction efforts, up to and including the recent Restructured Clinical or "RC" scales. Note also how few tests based on factorial scales have gained extensive clinical usage in personality assessment. I would urge everyone seriously involved with the MMPI-2 - above all if teaching or supervising its use - to study the K factor article and make their own decisions regarding these arguments. I believe this is the most important article ever written to understand what has made the MMPI so unique (see Dahlstrom & Dahlstrom, 1980, or contact us for a copy).


What correlational properties potentially affect interpretation of the K scale?


The Caldwell clinical data set (1997) is a mixture of clinical cases plus a good scattering of mildly disturbed and relatively normal subjects, a total of 52,543 individual protocols. The sample is significantly overeducated by the census but significantly less so than the MMPI-2 normative sample, of which latter 45% had graduated from college and 18% of the total normative sample had done postgraduate work.


In this data set, the K scale correlated 65 with the socioeconomic status scale (Ss, Nelson, 1952). This K to Ss correlation suggests that approximately 42% of the variance of K is due to SES and similarly, of course, education (note the analyses of this data set in Greene, 2000). The correlation of K with Mp (Malingering Positive, Cofer, Chance & Judson, 1949) was .50 suggesting that about 25% of the K variance can be explained by conscious defensiveness as measured by the Mp scale. Wiggins’ Sd (social desirability, 1959) correlates .28 with K, which might be another 8% except that Mp and Sd correlate .75; their combined contribution to the K variance would be slightly over 25%. The correlations of Mp and Sd with Ss are quite low and thus SES and conscious defensiveness are essentially independent of each other. Curiously, scale R (Welsh, 1965) correlates .30 with K and is negligibly or even negatively correlated with these other three scales, so it is close to a 10% contribution to the variance of K, and it is almost entirely independent of both SES and conscious defensiveness.

An almost totally unappreciated point is that - without our realization or appreciation - K has been correcting for the impact of SES from the time of its invention. The widespread use of reasonably educated and bias-free samples together with their usually middle class or higher SES in most of the negative studies on the K-correction has operated to conceal this function of K In addition, this is consistent with the correlation of Ss with the F scale: an almost startling -.77 (approaching 60% of the variance of F!). This shows how lower SES subjects do not learn what not to say whereas well educated subjects are in effect trained in what not to say as well as what you do say and just how you say it. I believe that understanding these relationships can considerably expand our understanding as to in what ways and how broadly the K factor is crucial to our interpretation of MMPI profiles.


In normal subjects (with no motives to bias their responses) the scores on K are quite stable over time. The longest term of followup of which I am aware is by Leon, Gillum, Gillum, and Gouze (1979) as part of a longitudinal cardiovascular research study. The 30 year reliability of K was .434; of the thirteen basic scales this was exceeded only by 0-Si, 5-Mf, 9-Ma, and 2-D in decreasing order. For the five partial interval retestings, ranging from 6 to 24 year intervals, the reliabilities of K ranged from .502 to .673 with three of the five over .60. Given that these subjects had no incentive to distort, this would be attributable to the effects of socioeconomic status and the emotional reserve of the Welsh R scale; both of these are attributes that one would expect to be reasonably stable over longer periods of time. Circumstantial demands to look good or bad would, of course fluctuate by the occasion, so their impact - had it been present - would have led to much lower long term reliabilities, but the absence of bias here led to correlations that are high considering the lengthy time intervals.


In his discussion of the K-correction, Greene (2000) comments on studies in which K was interpreted as a measure of personality integration and healthy adjustment (p. 95). This is directly consistent with the r of .65 for Ss with K, especially given samples of subjects with no incentive to distort or bias their responses. By many lines of evidence higher levels of SES and education would be expected to be associated with better personality integration. The component of consciously biased responding on K in effect was essentially inactive in these studies. Thus the assertions of an association of K with personality integration and healthy adjustment are validated mainly as a function of SES.


Could they validate the K-correction?


In order to test whether this new 30 item K scale was working as a correction scale, they did several experimental sequences with profiles falling between T-65 and T-80. They regarded this as the problematic or critical range since scores over T-80 would almost always be pathological, and scores below T-65 would be too low to be elevated to an assured level of belonging in the psychopathologic range. They then created four mixed sets of patient plus normal profiles and cut each batch at K over versus K at or below the T-50 average. The hypothesis was that those scoring above T-50 K would disproportionately be defensive patients and those below would more often be self-critical normals. Each of the four sets of data supported this hypothesis; recognizing the major difficulties in fully cross validating such a complex scale, they felt this at least showed that the K scale was working in the direction in which it should.


Is the K-correction still working?


There have been objections to the continued use of the K correction, with statements such as, "The bad news is that the K-correction doesn’t do anything; the good news is that it doesn’t do anything." The argument then is that it should be abandoned. The main anchoring data of such assertions have mostly been in studies of subjects who responded straightforwardly with no identifiable or consistent-across-subjects incentives to bias or distort their responses. These studies define circumstances in which the K-correction indeed has little to do - especially among reasonably educated subjects responding straightforwardly. These are precisely not the groups for whom the K-correction was designed.


A problem in at least a few of these studies was that the "criterion" has been a rating sheet filled out based on a single session with the subject. The ratings then are based very largely on what the person just said. But the non-K-corrected scale is also what the person just said. This operates as an experimental bias in favor of the non-K-corrected scale. For example, Barthlow et al. extended this to three hours (three sessions) which is a significant improvement in discovering the person’s self-deceptions, role playing, and other sources of a potential misfit of self-reporting. Three sessions is still much less "tuning in" to the subject’s potential biases and distortions than would be expected from a month’s admission to a University Hospital inpatient service, which was modal in the origin of the MMPI. Considering this experimental bias, it is perhaps a bit surprising that in Barthlow et al.’s data some of the correlations with the K-corrected scores exceeded at all the correlations with the non-K-corrected.


Putzke, Williams, Daniel, and Boll (1999) tested 61 patients with end-stage lung disease waiting for lung transplantations with considerable uncertainty whether a lung might become available in time to save their lives. The context defined a strong "pull" to appear psychologically healthy and deserving of priority. The 30 patients with higher scores on K (using a median split) obtained significantly lower raw scores on all of the K-corrected scales as well as Si (K-to-scale overlapping items having been deleted). After K-correction there were no significant differences except that Hs appeared possibly to have been a bit over-corrected. In this setting where there was a clear and consistent incentive to "look good," their data strongly supported the use of the K-correction. This was precisely the sort of group for whom the K-correction was designed. As the Putzke et al. study illustrates, the K-correction is working very well when and where it should: when the person has a strong incentive to bias the test responses in order to look too healthy or too disturbed.


In civil forensic actions such as child custody and denied employment, there is an almost universal incentive to appear healthy. In personal injury and workers compensation as well as criminal trials, there can be strong incentives to look damaged or impaired. Thus in contexts where such biasing is so consistently present, the utility of the MMPI would almost invariably be reduced without the K-correction: defensive profiles would be underinterpreted and exaggerated profiles would be overinterpreted. If K went uncorrected, then in the resulting confusion as to who was distorting in what direction and how much, the court’s trust of the MMPI would soon be severely damaged.




