Paper Abstracts

Day 2, April 5, Tuesday Morning

Session 1-A (Plenary Session)
Fellowship Hall   8:30 am to 10:00 am

What should be our “Foundations,” and why should we care anyway?   Mark Wilson

In this presentation, I will examine what has typically been seen as the foundations of Rasch measurement, and suggest that the scope of this should be widened to position our work as a scientific rather than a statistical enterprise.  I will discuss why I think that this sort of “navel gazing” is not only healthy, but particularly important at this juncture in the history of measurement in the arena of social, educational and health sciences.

The validity of validity: A comparative perspective on legal and test theories behind Debra P.    Brent Duckor

Employing a cross-institutional perspective we explore the concept of validity as it occurs in two different literatures—constitutional legal theory and psychometric test theory. We recast the Hart/Dworkin debate in legal theory and the Shepard/Popham debate (1997) in test theory as mutually reinforcing views of the validity enterprise. We note how the conceptualization of validity in psychometrics and the Testing Standards in particular has been influenced by landmark legal cases e.g., Debra P. v. Turlington. Through a comparative and historical analysis of these debates/discourses, we hope to arrive at a better understanding of the uses and limits of validity as a theoretical construct within the Rasch community.

Binary Discourses, Validity, and the Assessment of Writing     Nadia Behizadeh, George Engelhard

Writing assessment can be characterized as a community endeavor in which multiple, diverse voices need to be considered. However, the educational measurement community and writing research/composition community tend to utilize different and potentially contradictory epistemologies, methods, and constructs related to the construct of validity. Drawing upon key theoretical research, three overlapping binaries have been identified: positivist-postpositivist epistemologies (including tension over objectivity and invariant measurement); quantitative-qualitative validation methods, and psychometric-sociocultural approaches to validity. The purpose of this study is to analyze these perennial binary discourses related to validity and offer suggestions for developing a framework for conceptualizing validity in writing assessment that reflects the full spectrum of these positions.

Session 2-A: Objectivity and Reliability

Room B102 10:30 am to noon

Specific objectivity and the interpretation of item parameters under three versions of the Rasch Model.    Ernesto San Martin

Three versions of the Rasch model are considered: the fixed-effects model, the random-effects model with normal distribution, and the random-effects model with unspecified distribution. For each of the three, we discuss the meaning of the difficulty parameter starting each time from the corresponding likelihood and the resulting identified parameters. Because the likelihood and the identified parameters are different depending on the model, the identification of the parameter of interest is also different, with consequences for the meaning of the item difficulty.  Folowing Rasch, specific objectivity deals with the invariance of the difficulty parameters with respect to changes of the population of examinees. In this talk, we will explore which interpretation of difficulty parameters fits with specific objectivity as developed by Rasch.

On the Law of Supply and Demand: Hayek’s True Individual, Models, Measurement, and the Efficient Markets Hypothesis.   William P. Fisher, Jr.

Economic models of supply and demand are increasingly criticized for requiring the assumption that individuals have complete information on costs before making a decision on the price they are willing to pay. Hayek (1948, p. 54) offers an alternative approach, asking, “how can the combination of fragments of knowledge existing in different minds bring about results which, if they were to be brought about deliberately, would require a knowledge on the part of the directing mind which no single person can possess?” Partly in response to Hayek, answers to the question of how efficient markets are constituted can be framed in terms of the institutional roles, rules, and responsibilities through which transaction costs are reduced and through which stakeholders are able to formulate and execute long range plans of action. Not yet unaddressed, however, are the needs (1) to reformulate the law of supply and demand as a measurement model informed by substantive explanatory theory and that supports instrument calibration, and (2) to extend this reformulation and integration along with the institutional perspective into human, social, and natural capital markets. Fundamental to this reformulation are clear distinctions between Hayek’s theory of a socially-situated, “true” individualism and isolated, self-contained individuals, and between individual-level measurement models and group-level statistical models. The goal is to integrate these models for application to human, social, and natural capital.

Classical Test Theory and Hilbert Spaces.   Danny Avello,  Ernesto San Martín

IRT models are typically specified in a random-effects framework. By doing so, IRT models are embedded in the Classical Test Theory (CTT) framework and, consequently, standard CTT indicators, as reliability, as well as the prediction of the true score (ability), are estimated taking into account the IRT model specification. In this talk we analyze the general relationship between the reliability and the prediction of the true score. To do that, we use the formalization of CTT through a Hilbert space framework (see Zimmerman, 1975). Using the dual operator of the true score (namely, the prediction of the ability), we define what we call a "dual reliability"  and we proof that it is equivalent to the standard concept of reliability. Doing orthogonal decomposition, we obtain a general Spearman-Brown condition. Using these tools, we prove that higher is the reliability of a test, better is the prediction of the ability. 

Philosophical and Practical Approaches to Measure What Psychologists Want to Measure.   Jinho Kim

The purpose of this study is to understand psychological measurement in the light of philosophy of science and its practice. I considered philosophy of measurement as a common ground between the psychological measurement and the philosophy of science. Measurement is an essential tool of our thoughts as well as describing the world. For discussing philosophy of measurement through conceptual analysis, as a starting point of setting up a philosophy of measurement, I examined the assumptions on ontology and epistemology of knowledge. I found that I hold the ontological realism and epistemological constructivism for both physical and psychological sciences. On the basis of my assumptions on knowledge, I found the “constructive realism” as a supportable philosophy of measurement. From the result of conceptual analysis, constructive realism works well in both physical and psychological sciences. For the practice supported from the constructive realism, I found that four building blocks of the BEAR Assessment System is a sound practical framework of psychological measurement. They are a circulation of measurement process and the whole process of it is embodied within the nomological network of the conceptual foundations of constructive realism. Constructive realism and BEAR Assessment System building blocks can give us both coherent and different accounts for measurement process between psychological and physical sciences.


Session 2-B: Polytomous Models and Anchoring
Room B103 10:30 am to noon

Georg Rasch and Benjamin Wright’s struggle with the unidimensional polytomous model which has sufficient statistics for its parameters.   David Andrich 

This paper reproduces correspondence between Georg Rasch of The University of Copenhagen and Benjamin Wright of The University of Chicago in the decade from January 1966 to July 1976 as they explain their struggle to operationalize a unidimensional polytomous model with sufficient statistics. The paper then explains the original approach taken by Rasch, Wright and Andersen, and then how, from a different tack originating in 1961 and culminating in 1978, three distinct stages of development led to the current relatively simple form of the model. The paper shows that over this period of almost two decades, the demand for sufficiency of a unidimensional parameter of the object of measurement drove the development of the model.

A Polytomous Rasch Model Guaranteed to Produce Ordered Thresholds.   Chris Bradley 

A major problem with current polytomous Rasch models is that they often estimate disordered thresholds. This can be shown to be inconsistent with the assumption that every person uses ordered thresholds to rate items. Here, we present a polytomous Rasch model that naturally produces ordered thresholds. Conceptually, the innovation is to infer the probabilities of different observations from threshold distributions that we treat as post-hoc distributions. In other words, we never randomly sample from these threshold distributions, in contrast to current models where random sampling from multiple overlapping distributions often leads to disordered thresholds.

JMLE estimation with anchored (fixed) item and/or person Rasch measures.   John Michael Linacre 

Anchored Rasch measures are used in item banking, test equating, CAT and many other situations. JMLE estimation is uniquely flexible in that some or all items and/or persons and/or Andrich thresholds can be anchored. Unanchored items, persons and thresholds can then be estimated. This has been done since 1990 in Winsteps and its predecessor software, and in parts earlier.

However, test plans are becoming ever more complex with increasingly sparse data. Consequently earlier JMLE estimation methods with anchored elements are producing less and less satisfactory estimates for unanchored elements.

After considerable thought and experimentation, a better estimation method has been devised, tested and implemented in Winsteps. The method is described in the hope that developers of new software, such as R modules, will incorporate anchoring into their Rasch routines.


Day 2, April 5, Tuesday Afternoon

Session 3-A: Analysis of Fit 

Room B102    1:30 pm to 3:00 pm

Using a Tukey-Hann Procedure to Estimate Empirical Item Responses Functions to Evaluate Model-Data Fit of the Rasch Measurement Model.    Jeremy Kyle Jennings

This study describes an approach for examining model-data fit for the dichotomous Rasch Model using a Tukey-Hann (T-H) procedure to estimate empirical Item Response Functions (IRFs).  Tukey proposed an approach for smoothing data called Hanning (Tukey, 1977).  Douglas and Cohen (2001) proposed a root integrated squared error (RISE) statistic to examine model-data fit by comparing nonparametric/empirical with parametric/theoretical IRFs.  Data from undergraduate students (N=1,255) in an introductory statistics course at a large university in the southeast are used to demonstrate this approach.  The test (25 items) covered material dealing with confidence intervals (CI) and hypothesis testing (HT) for single samples.    This study compares the residual based Infit and Outfit statistics with the RISE statistic. Preliminary results suggest that the RISE statistic in conjunction with the T-H procedure provides useful analytical information and diagnostic graphical displays for evaluating item misfit.

Comparison of Bfit and Mantel-Haenszel for Examining Differential Person Functioning in the Rasch Model.  A. Adrienne Walker

This study compares the use of the Between Group (Person) fit statistic, Bfit, and the Mantel-Haenszel procedure on a transformed item-person matrix for detecting differential person functioning (DPF) in the Rasch model.  The analysis is based on simulated data and indicates that the Bfit and Mantel-Haenszel procedures have similar accuracy detection rates of DPF with data designed to fit the model with only small amounts of DPF.  The Bfit statistic appears to have more power to detect DPF when larger numbers of persons exhibit DPF.  However, in general, the power and accuracy of DPF detection were low for both procedures.   

Evaluation of item fit in rehabilitation outcomes research.   Brittany Hand

Item response theory (IRT) methods have become best practice in developing new patient reported outcome measures in rehabilitation (Velozo et al., 2012). It is unknown, however, how rehabilitation researchers evaluate individual item fit to IRT models. As evaluation of item fit is an integral part of instrument development with IRT models, it is necessary to characterize current trends in the literature (Reeve et al., 2007). We conducted a systematic review to characterize item fit evaluation in rehabilitation literature.

Five databases were searched with relevant search terms. A-priori inclusion criteria required that articles were published in a peer-reviewed journal in English between 2005-2015 and reported individual item-fit statistics to an IRT model. First, duplicate articles were removed from search results. Next, article titles and abstracts were evaluated for inclusion. Last, full texts of remaining articles were reviewed. 

One hundred seventy-five articles met inclusion criteria. Descriptive statistics were performed to determine trends in the type of IRT model used, number of methods of item fit evaluation, and type of item fit evaluation. Most articles used the Rasch/1PL model (n=103). Sixty-seven percent of articles used more than one method of item fit. The most commonly used measure of item fit was differential item functioning (n=159), followed by residual-based statistics (n=88). 

Methods of item fit evaluation were almost exclusively contingent upon the statistical software used, without considering item information. Future research should include multiple types of item quality indications to promote a global understanding of individual item functioning when developing and validating rehabilitation instruments.

A Comparison of Item Fit Indices for Rasch Model.   Jinwang Zou

Classic item fit indices in Item Response Theory all require grouping strategy, which results in certain limitations in application. This proposed research aims to compare the performance of three goodness-of-fit indices for item fit assessment and they do not require grouping strategy. The three new goodness-of-fit indices are Standardized Pearson test, Unweighted Sum of Squares test, and Information Matrix test. A simulation work will be done to compare the performance of the three item fit indices in Rasch Model scenario. The performance will be compared based on controlling type I error rate first and then on the power among the three statistics.


Session 3-B: Testlets and Dimensionality

Room B103    1:30 pm to 3:00 pm

Undimensional Interpretation of Multidimensional Data.   Steffen Brandt

One of the basic assumption in IRT is that a test has to be unidimensional in order to be interpreted unidimensionally. This seems to be a redundant notation, however, as a matter of fact in many cases one and the same test is interpreted unidimensionally as well as multidimensionally. Examples are large scale assessments such as PISA, TIMSS, PIRLS, and NAEP. This paper discusses the different approaches currently used by these as well as possible alternatives, given by hierarchical models and the Generalized Subdimension Model. Beyond the technical characteristics of the used models and their impact on, e.g., reliability and model fit, the discussion also emphasizes the impact of the different approaches on the interpretation of the results, that is on their validity.

Polytomous extension of Item Explanatory Rasch Models: An application to the Carbon Cycle assessment data.   Jinho Kim

It is likely to be confused between item property effects and multidimensionality, when items and persons are interacted. For this case, this study addresses how explanatory item response models can be applied to the Carbon Cycle test using polytomous items, by comparing models of goodness-of-fit, parameter agreement, and practical interpretability.

A Combination of Diagnostic Classification Model and IRT Model with Testlet Effect.   Manqian Liao

Diagnostic Classification Models (DCM) use a set of categorical latent variables to indicate the examinees’ mastery state of a set of skills in order to provide useful feedback for test-takers or other stake-holders. However, it does not provide a general ability estimate that can be used for accountability purpose, which might not fulfill the needs of some high-stake tests. Neither does it take testlet effect into account, which could lead to the violation of local item independence assumption and biased diagnostic information.

This study proposes a model that combines testlet IRT model with DCM with two aims. Firstly, the proposed model intends to take into account local item dependence due to the testlet effect. Secondly, it aims at realizing dual purposes (both accountability and diagnostic purposes) within one test administration. In the proposed model, mastery states, continuous trait, and testlet effect will be estimated simultaneously. The parameters will be estimated with Maximum Likelihood Estimation and a simulation study will be conducted to evaluate the estimation quality.

A Multi-Group Cross-Classified Testlet Model for Dual Local Item Dependence in the Presence of DIF Items.   Dandan Liao

The key assumption of local item independence and local person independence is required when applying standard item response theory (IRT) models. In practice, some assessments might use items that are clustered around the same content area or scenario, which introduces local item dependence. Other assessments might intentionally or unintentionally involve examinees that are clustered in educational units, which introduces local person dependence. Multi-group testlet models have been proposed to account for local item and person dependence concurrently. In addition, when the cross-classified structure is fitted with hierarchical structure, the standard error estimates associated with the incorrectly modeled clustering variable will be underestimated (Meyers & Beretvas, 2006). This study intends to generalize multi-group testlet model to accommodate complex cross-classified item clustering structure in testlet-based assessments. A muti-group cross-classified testlet model is evaluated in simulated conditions. Preliminary results using an exemplary dataset are provided to evaluate parameter recovery of the proposed model. This study will provide practitioners with more empirical evidence in analyzing data with cross-classified testlet structures in a multi-group framework. Results of this study could also be helpful for comparing how items function differently across non-equivalent groups without the measurement invariance assumption.


Session 4-A:  Test Construction / Validation A
Room B102     3:30 pm to 4:30 pm

Analysis of a reading comprehension test using the Rasch model.   Nadine Talbot

Reading tests often do not reflect the complexity of reading comprehension (Keenan, Betjemann et Olson, 2008; Sweet, 2005). To assess different levels of reading comprehension among fifth grade of french elementary school a test was developed. This test is currently used by several teachers in their classes. The objective of this research is to determine the psychometrics qualities of this test. It consists one informative text of 587words with complexity controlled for topic, vocabulary, syntax and length. After reading the text, students answer at eleven open-ended questions about text organisation, textual information, main ideas and different inferences. The sample consisted of 163 students in 9 class groups divided in 4 french elementary schools. The psychometrics qualities of the reading comprehension test were analyzed with the Rasch model. The assumption of unidimensionality was determined by a scree test (Cattell, 1966). During this presentation, firstly, the reading comprehension test will be presented. Secondly, the item characteristic curves, test information function, the level of difficulty and the discrimination will be discussed.

Lens Models of Human Judgment for Rater-Mediated Assessments.   Jue Wang

Egon Brunswik developed the lens model within the context of probabilistic functionalist psychology (1952, 1955a, 1955b, 1956).  Several researchers have adapted the use of lens models to examine human judgments (Hammond, et al., 1964; Tucker, 1964; Cooksey, 1996). Previous empirical research based on the lens model has utilized correlational and regression-based statistical analyses. The purpose of this study is to explore the use of lens models with methodological analyses based on Rasch measurement theory within the context of rater-mediated assessments. Data from 50 middle-grade students are used to illustrate how the lens model can improve our understanding of judgments in a high stake writing assessment program in Georgia.  Rasch measurement theory has been already used to evaluate raters (Engelhard, 1992, 1994, 2013). This study systematically uses the lens model framework to extend this earlier work, and to explore new pathways for understanding the quality of ratings that obtained in the rater-mediated assessments. Preliminary results indicate that different domains used to evaluate writing vary in judged difficulties, and that the rating-scale structures vary over raters. The lens model provides a new perspective and approach for evaluating rater accuracy and rating quality. The final paper will highlight the implications for research, theory and practice using lens models to evaluate the psychometric quality of rater-mediated assessments.

New response times based index for detecting aberrant behavior.   Kaiwen Man

Response Times (RTs) have many advantages for detecting aberrant examinee behavior. Several detection methods have been proposed, but these methods require users to choose an appropriate RT model. Here, we propose a new nonparametric discordant analysis method and present the results of a simulation comparing it with existing methods. The RT model is Rasch based response time model that could adapt to the practical world more easily.

Proposing a new person-fit index based on the Response Times (RTs) rather than item Response to detect the aberrant behaviors in the educational testing. The new index could improve the accuracy of detecting different kinds of aberrant behaviors with higher sensitivity compared with other recent proposed indexes.

Comparison of classification accuracy based on item response theory (IRT) and measurement decision theory on tests with polytomous items.   Yating Zheng, Hong Jiao, Kecheng Xu  

Recently, measurement decision theory, which derives from Bayesian theorem, becomes an attractive alternative to IRT in making classification decisions. Its key idea is to get a best estimate of an examinee’s mastery state based on the examinee’s item responses, item parameters and population classification proportions. Previous research indicates that measurement decision theory requires fewer items and smaller sample sizes to achieve the same level of accuracy than the IRT models. However, these studies primarily focused on dichotomously scored items. This study explores the application of measurement decision theory into tests with polytomous items. Comparison is conducted in classification accuracy between the partial credit model and measurement decision theory at different test lengths with various sample sizes on tests with polytomous items using simulated data. Results from 1000 replications indicate that in general measurement decision theory provides higher classification accuracy than the partial credit model. Among the three criteria of measurement decision theory (maximum likelihood, minimum probability of error, and maximum a posterior), maximum a posterior (MAP) criterion provides the highest classification accuracy. The classification accuracy of MAP for tests with very short test length and small sample sizes is even higher than that of the partial credit model for tests with long test length and large sample sizes.


Session 4-B:  Test Construction / Validation B
Room B103     3:30 pm to 4:45 pm

The Test Design Line for modeling the distribution of items.   Agustin Tristan

I present more evidences of the need of a model for the distribution of the items, uniformly distributed in difficulty accross the scale from low to high. This idea follows the proposals by Wright & Stone (1988) and other analysts, but it completes with the concept of the "Test design line" (TDL) that provides the expected difficulties for a given set of ítems, as suggested by Tristan and Vidal (2007). 

Several evidences of the pertinence of the TDL were obtained in the last few years and the paper presents its use on different tests: national and international, used in a single class or nationwide. On the basis of the TDL a set of analysis may have a clear interpretation and also a theoretical reference, such as differential test functioning, expected Cronbach’s alpha, and construct validity analysis.

Evaluating the Quality of an Operating Item Bank for Trait Estimation by Assembling Tests with Different Types.  Chi-Chen Chen, Hsiu-Yi Chao, Jyun-Hong Chen

In order to evaluate a given item bank, this study proposed an approach for exploring the strength of the item bank in trait estimation based on assembling five distinct testing formats. These five testing formats are CAT, OAT, RA, CUT, and WAT with the first two for providing the empirical upper bound and theoretical upper bound, the third one for the baseline, and the last two methods for the empirical and theoretical lower bounds of an item bank’s efficiency, respectively. Three simulation studies with different degrees of practical considerations concerned in the simulation setting were planned to be conducted to demonstrate the implementation of the item bank evaluation approach proposed in the study. Based on the result of study 1 which has finished, it indicated the approach is capable to provide critical insight with respect to practical considerations in trait estimation for both test practitioners and test takers. Though more studies are needed for further validating the approach, due to its easy implementation, the authors strongly recommend to provide the evaluation outcomes for operating item banks based on these five testing formats.

Exploring Adaptive Property with Heuristic Item Selection Procedures for Improving Trait Estimation in Computerized Adaptive Testing.   Jyun-Hong Chen, Hsiu-Yi Chao, Chi-Chen Chen, Shu-Ying Chen

Compared with paper and pencil test, computerized adaptive testing (CAT) poses two distinct properties, which are adaptive property and sequential property. The adaptive property refers to applying a certain rule (e.g., maximum Fisher information) for selecting the optimal item that is suitable to the examinee’s current trait estimate, which is known as the main reason that yields the CAT’s efficiency. However, rare studies have noticed that the interaction effect between the adaptive property and sequential property are also beneficial to test efficiency by preventing the perfect response pattern. The purpose of this study, thus, is to examine the adaptive property and sequential property and to utilize their interaction effect to improve trait estimation for examinees with extreme trait level. Two simulation studies will be conducted accordingly, with the first one to verify the adaptive-sequential interaction effect and the second one to investigate the proposed item selection procedure. We expect that the trait estimation of CAT is more precise than that of optimal adaptive testing, which assembles the optimal test based on true trait level, for examinees with trait level close to 1 but is worse for examinees with extreme trait level, and the heuristic item selection method proposed can improve the trait estimation for these examinees.

Applying the Mantel-Haenszel Method to Exploratory Differential Item Functioning Assessment in Continuous Items.  Hsiu-Yi Chao, Jyun-Hong Chen, Ching-Lin Shih

Since DIF assessment is critical for test validity and fairness, several DIF assessment methods have been proposed for discrete items. However, the detection method for continuous items is rarely been established. This study proposes the method based on the MH method which has been popularly applied in practice use, and incorporates it with the scale-purification and DIF-free-then-DIF strategies for exploratory DIF assessment in continuous items. According to the simulation study, the continuous MH method with scale-purification can detect DIF items effectively with better performance in all conditions. Since it can be easily implemented with simple calculations, this method is recommended for practical test practitioners and test administrators in DIF assessment to increase the quality of tests with continuous items.

Identifying Compromised Test Items Using the Rasch Model and Support Vector Machines.   Sarah Thomas

Each year thousands of people must pass standardized exams to be certified in medical, legal, clinical, and technological fields. Unfortunately, the increasing number of examinees taking such tests seems to have been accompanied by an increase in the instances of reported cheating and the invention of more sophisticated cheating techniques. The stealing and sharing of proprietary test content, such as when items are recorded and then leaked, is associated with legal consequences for the culprit but can also have negative consequences for the integrity, reputation, and budget of a testing program. Leaked items, also called compromised items, represent a significant threat to the integrity of testing. The purpose of the present study is to detect compromised items using a combination of Rasch model analysis and Support Vector Machines (SVMs) from the field of machine learning. In this study, we combined the results of a Rasch model analysis with SVMs to detect items of an international healthcare certification exam (N = 13,584) that were compromised in screenshots or notes. The SVMs classified items as suspected compromised or suspected uncompromised using Rasch model estimates, representing a novel method of combining SVMs with the Rasch model. We will discuss the results in terms of which Rasch model estimates are most important in item classifications, the overall accuracy of our SVMs, and the future of this methodology for classifying examinees.


Day 3, April 6, Wednesday Morning

Session 5 (Plenary Session)
Fellowship Hall   8:30 am to 10:00 am

Noncognitive Indices for the Nation’s Report Card.   Jonas P. Bertling

The core goals for education systems in the 21st century have shifted from teaching clearly defined knowledge and skills to promoting lifelong learners who are able and eager to face the demands and challenges of a truly global society. National and international large-scale assessments (LSA) have started to broaden their focus from achievement results as their only key outcome to also measure skills, strategies, attitudes, and behaviors that are distinct from content knowledge and academic skills, often called “noncognitive factors”. This change reflects the shift from viewing questionnaire variables primarily as background factors that put achievement results into context to viewing these variables as measures of separate constructs in their own right. Measuring noncognitive factors in LSAs faces the challenges of how robust measurement approaches can be implemented while keeping student burden low, maximizing cross-cultural comparability of recorded responses, and mitigate growing concerns of students, parents, teachers, or school administrators about sensitivity and privacy around questions asked in school-based surveys. I will present current efforts for the National Assessment of Educational Progress (NAEP) to develop several new noncognitive modules for the Nation’s Report Card and highlight innovations in two key areas: (1.) expanding the constructs measured in NAEP to include factors such as grit, desire for learning, or self-efficacy; and (2.) creating IRT-based questionnaire indices for enhanced reporting. My talk will close with a discussion of promising directions and challenges for future research and practice on noncognitive measurement in LSAs.

Substantiating Goal Fulfillment in Higher Education: Developing Useful Measures of Non-Cognitive College Success Indicators.   Eric Jenson, Danny Olsen, Richard Sudweeks

During the last few years, accreditation bodies have increasingly tasked Higher Education institutions with measuring and reporting the degree to which university core objectives are being accomplished. Many universities are using self-report surveys and classical statistics to indirectly measure and document the fulfillment of these objectives. Scale creation and analysis using classical statistics has many limitations and challenges that can obscure the real outcomes. In an effort to provide better measurements of college success indicators, researchers at a large private U.S.-based University developed constructs reflecting the University core cognitive and non-cognitive objectives. These constructs were then analyzed using a Rasch methodology.  Besides illustrating current reporting schemes, the researchers will explain the process followed, results found, and the challenges experienced in creating reports understandable to non-Rasch versed audiences.  

Unpacking the Opportunity-to-Learn Measures in PISA 2012 using Multidimensional Partial Credit Model with Latent Regression: A Case of Indonesian Students.   Diah Wihardini

This study investigates students’ opportunity-to-learn (OTL) measures introduced in PISA 2012 to determine how student background influences OTL so that policy makers can orient education policy reforms accordingly. Such reform is especially needed in a developing country with stagnantly low performance on PISA like Indonesia. Using the PISA 2012 dataset on Indonesian students, we examine to what extent (1) the OTL aspects (i.e. Content Exposure, Teaching Practice, and Teaching Quality) can be measured using the given indicators of effective classroom learning practices and environment, (2) the modeled OTL measures can explain math performance, and (3) student/school background variables impact the OTL levels. Multidimensional Rasch models with multiple latent regression were used in data analysis. Preliminary results provide evidence of the multidimensionality of OTL and a low correlation between OTL and math ability (r = 0.2). For these Indonesian students, the different grade levels, school type and program, math learning time per week, and availability of additional math lessons in school gave significantly positive effect on the content exposure aspect of OTL, while getting an out-of-school math lesson brought a positive impact across all OTL aspects. The findings show that the differences in local school contexts provide some rationale for the variability of students’ OTL, which in turn affect academic performance. Factors lessening the OTL aspects can be identified and potentially seen as significant hurdles for the implementation of the education reforms for students of diverse background, and thus seriously addressed in the reform efforts.


Session 6-A: Mind and Body
Fellowship Hall   10:30 am to noon

The application of the “33 cent ruler” for measuring recovery outcomes in psychiatry.   Skye Barbic

Background: Recovery is an overarching goal for mental health service provision and a target for system reform. Surprisingly, very little information is available about how to measure recovery in psychiatry. In 2006, Stenner proposed a definition of a construct theory to be “the story we tell what it means to move up and down a scale for a variable of interest” (p. 308). The application of this definition to rating scale development or testing in psychiatry has yet to be explored. 

Objective: To develop a clinically meaningful measure that robustly captures the recovery story of people with mental illness and mental health problems.

Methods: This study had four phases. In Phase 1a, individuals with mental illness completed three commonly used recovery rating scales. We used Rasch Measurement (RM) methods to assess the extent to which each scale covered the full range of recovery (+/- 3 logits) and reflected a meaningful story of how patients move from low to high. In Phase 1b, we conducted two focus groups with patients to review the results from Phase 1a and propose new items to cover the full range of the concept. In Phase 2, we administered the new items to a new sample, tested their psychometric quality using RM methods, and conducted 1 more focus group with patients to review results. In Phases 3, the Phase 2 methods were repeated. 

Results: A total of 1077 participants filled in the recovery items, and 56 patients participated in the focus groups.  Analysis of the final set of items good overall fit to the Rasch model (p > .05) and strong construct validity. 

Conclusions: This study supports a new approach to measure of outcomes in psychiatry as defined by people with mental illness. The measurement model and approach underpinning this study has the potential to support the clinical relevance of rating scale scores, thereby advancing an evidence-based approach to psychiatry and mental health rehabilitation.

Examining the Psychometric Quality of Household Food Security Estimates Obtained from the U.S. Household Food Security Survey Module.   Emily M. Engelhard

The purpose of this study is examine the psychometric quality of the U.S. Household Food Security Survey Module (HFSSM) of the Current Population Survey Food Security Supplement (CPS-FSS).  The HFSSM is conducted by the U.S. Census Bureau on behalf of the Economic Research Service of the U.S. Department of Agriculture.  The HFSSM is the main instrument used to measure food insecurity in the U.S. households.  The HFSSM consists of 18 items (10 items for all households, and an additional 8 items for households with children under 18).  Data from the most recent administration of the CPS-FSS (approximately 45,000 households) are used in this study.  This study focuses on model-data fit to the Rasch model with particular emphasis on extending person fit analyses to examine household level fit within the context of measuring food insecurity.  The implications of this research for future research, theory and practice related to food insecurity will be included in the final study.   

Metrological standards for use in health-care decision making:  An update on research involving the BREAST-Q.   Stefan Cano

Since 2009, BREAST-Q has been used to study health-related quality of life and patient satisfaction in surgical research, clinical practice, and quality improvement initiatives. In this paper, we provide an update on these activities. The two drivers for this paper can be summarised in two quotes. First, when asked about what advice to give aspiring psychologists, Adrian Furnham (Professor of Psychology, University College London, UK) responded: “First, devise and validate a good test (or any sort); second do a few good meta-analyses and/or systematic reviews (and update them)” [1]. Despite the existence of thousands of patient-reported outcome instruments [8], it is rare to find detailed accounts of their use and impact beyond the initial development. The second quote is from a Health Affairs blog by Elizabeth Teisberg (Professor of Health Care and Service Delivery, Dartmouth’s Geisel School of Medicine) who wrote: “The quest for better health care, driven by measuring safety and quality, is well intentioned and has notable achievements. But…the lack of measurement standards [has resulted] contradictory conclusions” [2]. As such, there is increasing pressure on patient centred outcome instruments to be fit for purpose to deliver accurate, consistent, invariant and traceable measurement. This opens up serious considerations about the introduction of metrological standards for use in health-care decision making and a coordinated regional, and an international effort to tackle these challenges is recommended.


Session 6-B: Validity Across Disciplines

Room B102   10:30 am to noon

Validating measures of connectedness, empowerment, and meaning: A multi-dimensional item response model study of non-cognitive readiness scales.   Brent Duckor

Examining a sample of 7-12 grade student responses in California public schools, we found a sufficient degree of internal structure validity evidence (APA, AERA, NCME, 2014) to support diagnostic uses of the survey instrument to measure student non-cognitive capacities, particularly among low-income, culturally and linguistically diverse middle and high school students. Using a construct mapping framework, our initial IRT examination of the data found that the unidimensional Partial Credit Model (Wright & Masters, 1982) fits the data very well and results in high reliability (.91). Further multi-dimensional analyses of a sub-sample of the data (n=472) provided evidence for three distinct constructs--connectedness, empowerment, meaning--with only slightly lower reliabilities. Discussion includes focus on alternative paradigms for evaluating non-cognitive outcomes.

Switching Between Models to Measure Metacognition.   George M. Harrison

Efforts to examine variation among learners in their metacognitive awareness—such as for examining intervention effects or correlations with achievement—presuppose that the instruments used to measure metacognition are functioning well. The 52-item Metacognitive Awareness Inventory (MAI) developed by Schraw and Dennison (1994) is a self-report questionnaire that has been extensively used for these purposes. Their intent was to measure eight subcomponents and two overarching components, knowledge and regulation, of metacognition. Previous research examining this intended factor structure is sparse. Taking a pragmatic approach to modeling ordinal data, the authors used both confirmatory factor analysis (CFA) and multidimensional item response modeling (MRCML models) to examine (a) how well the MAI’s intended structure holds up to empirical data, and (b) whether a subset of items can be identified to better fit its intended structure. Using data from 436 undergraduate students, we found that poor global fit indexes and multiple item dependencies indicated that the 52-item MAI functioned somewhat poorly. In seeking a parsimonious instrument, we recursively alternated between CFA and MRCML models, eliminating items with high redundancy, poor fit, category disorder, and item dependence while keeping the eight subcomponents represented. We identified a 20-item shortened version, with good global and item fit, less redundancy, no obvious category disorder, and one instance of item dependence. Though the items might function differently when administered as a shortened set, the results provide empirical evidence for use in validity arguments, particularly when deciding how to best aggregate students’ responses into inventory scores.

Validity of the teaching assessment portfolio used by the Chilean National Teacher Evaluation System (NTES).   Daniela Jiménez, David Torres Irribarra

This paper reports analyses concerning the validity of the teaching assessment portfolio used by the Chilean National Teacher Evaluation System (NTES). 

As part of a validation effort, we look at four facets of the NTES: (a) the dimensionality of the instrument, (b) the performance of the indicators in distinguishing teacher performance levels, (c) the robustness of the teacher classification into performance levels, and (d) evidence of convergence with other instruments of teacher quality.

On the topic of dimensionality, we discuss the complexity of assessing and summarizing the myriad of aspects involved in teacher practice into a single set of performance levels (Unsatisfactory, Basic, Proficient, Outstanding) and compare the NTES model to alternative theoretical models used by other well known teacher assessment instruments. This analysis is used as context to interpret results of a confirmatory factor analysis performed on the NTES data.

Regarding the performance of the indicators used to rate teacher performance in the portfolio, we use the Level Partial Credit Model (L-PCM) to examine the extent to which the indicators support the determination of a set of performance cut-points.

To examine the robustness of teacher classification, we compare the NTES classification to alternative classifications produced by three latent variable models: (i) the L-PCM, (ii) the Latent Class extension to the L-PCM, and an Ordered Latent Class Model.

Finally, we examine the consistency between teachers’ performance in the NTES with their results under the Classroom Assessment Scoring System (CLASS) and the Learning Mathematics for Teaching (LMT).


Day 3, April 6, Wednesday Afternoon

Session 7-A:  Comparability and Stability
Fellowship Hall   1:30 pm to 3:00 pm

Potential threats to Cross Country Comparisons of Adolescent Mental Health – A Rasch analysis based on HBSC-data from four Nordic countries.   Curt Hagquist


Comparisons of time trends in adolescent mental health between different countries are drawing increasing attention. The purpose of the study is to analyse the psychometric properties of a scale on psychosomatic problems, with a focus on requirements for invariant comparisons across countries and across years of investigations.


The study makes use of data collected from the Health Behavior in School-aged Children (HBSC) study among students 11, 13 and 15 years old in Denmark, Finland, Norway and Sweden at six years of investigations. The Rasch model was used to examine if the instrument consisting of eight items with five response categories met measurement requirement of invariance and proper categorisation. A particular focus was on Differential Item Functioning (DIF) across countries and time.


The analyses show that the responses to the original item set can’t be summarised and be used as a proper measure of psychosomatic health problems. The categorisations of the items do not work as intended, which is manifested by items showing disordered thresholds. Some items do not work in the same way across countries and time, which is evidence of DIF. These problems do affect the person measurement, in particular the time trends for girls in Finland and Sweden.


One option to address the problems would be to collapse two pairs of response categories and to replace the problematic item in the Finnish data, since there are Finnish data available that better correspond to the content of the item Low in the other Nordic countries.

Model fit and item hierarchy stability of repeated measures data: Initial results from Rasch analysis of R-SPQ-2F.  Vernon Mogol, Marcus Henning, Yan Chen, Andy Wearn, Jennifer Weller, Jill Yielder, Warwick Bagg

The Revised Two-Factor Study Process Questionnaire (R-SPQ-2F) was developed using True Score theory to measure students’ deep (DA) and surface (SA) approaches to learning.  Using the stacked data, the study investigated the extent to which DA and SA scales satisfy the Rasch measurement model.  Using the racked data, it also examined how rankings of item locations remained stable across time.  Data were fitted to the model using WINSTEPS. Scale assessment included investigation of rating scale functioning, item fit, targeting, reliability, dimensionality and differential item functioning (DIF).  Item hierarchy stability was examined graphically. University of Auckland medical students were invited to participate in a longitudinal study.  Data were gathered at eight time points from July 2013 to October 2015. A total of 907 students participated in the study; only 24 participated at all time points. Initial Results showed that the rating scale functioned well and the items showed acceptable fit for the two scales. The SA scale had poorer targeting than the DA scale; the mean person location (-1.12) was far from the mean item location. The range of person locations (-4.24 to 2.32) was not adequately covered by item thresholds (-4.68 to 3.21) that there were not enough thresholds to accurately estimate locations of persons with very low SA.  Both scales had acceptable reliabilities (> 0.70); they also satisfied the model’s unidimensionality requirement. No gender DIF was detected.  The item hierarchy was relatively stable across eight time points.

Growth in Numeracy proficiency: how guessing on multiple choice items affects a vertical scale.   Ida Marais

The dichotomous Rasch model is used often to analyse assessments with multiple choice items, even though it makes no provision for guessing. In this study a procedure for removing bias due to guessing from Rasch item estimates is applied to vertically scaled numeracy assessments. Results show that that procedure eliminates the bias and that the more difficult items appear more difficult than when there is no correction for guessing. The proficiency estimates that result from unbiased difficulty estimates are increased relative to the standard Rasch analysis proficiency estimates, especially higher proficiencies showing that the standard Rasch model analysis underestimates growth in numeracy across the year groups.


Session 7-B: Rating
Room B102   1:30 pm to 3:00 pm

Exploring the relationship between essay features and rating quality.   Stefanie A. Wind

Although raters are trained to use certain essay characteristics to guide their evaluation of student essays (e.g., those listed on the scoring rubric), additional research is needed to more fully understand the influence of unintended cues in essays that may influence the rating process. The current study explores the relationship between linguistic features of essays, such as syntactic complexity and level of narrativity, and the quality of ratings in the context of large-scale writing assessments. Specifically, a sample of handwritten student essays from a large-scale writing assessment was transcribed and analyzed using the Coh-Metrix Text Easability Assessor (McNamara et al., 2014). Then, correlation analyses were used to explore the relationship between textual characteristics of essays and indices of rating quality based on the Rasch Rating Scale model (Andrich, 1978) and a dichotomous model for rater accuracy (Engelhard, 1996) within domains on an analytic rubric. Preliminary findings suggest variation in the relationship between essay characteristics and Rasch-based indices of rating quality across the Conventions, Ideas, Organization, and Style domains on the analytic rubric. Implications for research, theory and practice will be discussed in the final manuscript.   

Using pairwise comparisons and calibrated exemplars in the standardized assessment of student writing.  Joshua McGrane, Stephen Humphry, Sandy Heldsinger

Standardized assessment programs have increasingly moved toward the inclusion of extended written performances, amplifying the need for reliable, valid and efficient methods for writing assessment. This study explores the extent to which a two-stage method using calibrated exemplars may provide a viable alternative or complement to existing methods of writing assessment.  Written performances were taken from two years of Australia’s standardized assessment program, which included both narrative and persuasive performances from students aged 8 to 15. In Stage 1, markers performed pairwise comparisons using two criteria, authorial choices and writing conventions, on 160 performances to form a scale of 36 calibrated exemplars. Consistent with previous research, the pairwise judgements showed a very high level of reliability and concurrent validity. In Stage 2, markers assessed 2380 new narrative and persuasive performances by matching them to the most similar calibrated exemplar using the same two criteria. These matching judgements showed a generally high level of reliability and concurrent validity, and were more efficient than rubric ratings after a familiarisation period. However, the levels of reliability and concurrent validity were somewhat lower than in previous research for the narrative performances and the conventions criterion. Further research is suggested to enhance Stage 2 by using fewer exemplars and supplementing them with detailed descriptors. Overall, the findings provide preliminary support for the viability of the method in writing assessment across a wide range of ages. The article also discusses the potential application of the method for comparable performance assessments in a range of learning areas.

Examining Rater Judgements in Music Performance Assessment Using Rasch Measurement Theory.   Pey Shin Ooi, George Engelhard, Jr.

The fairness of raters in music performance assessment has become a concern in the field of music. The assessment of students’ music performance mainly depends on rater judgements. Hence, the quality of rater judgements is crucial to provide fair, meaningful and informative results. However, in real life, there are many external factors that can influence the quality of rater judgements. Different analyses have been carried out to examine the quality of rater judgement (e.g., Generalizaability Theory). However, there are limitations with the previous analysis methods that are based on Classical Test Theory. Thus, this study proposes the use of modern measurement theory, Rasch measurement theory, to examine the quality of rater judgements. The Many-Facets Rasch Rating Scale Model was employed to investigate the extent of rater-invariant measurement in the context of university degree students’ music performance assessment (N=159) rated by 24 raters. This includes examining the rating scale structure, the severity levels of the raters, the interaction between rater severity levels and musical instrument subgroups, and, the interpretations of the items by the raters across musical instrument subgroups. The preliminary results found that there were differences in severity levels among the raters and unexpected rating patterns. The results of this study also suggested that raters had different severity levels when rating different musical instrument subgroups and varied in interpreting the items across musical instrument subgroups. The implications for research, theory and practice in the assessment of music performance will be included in the final manuscript.  


Session 7-C:  Growth
Room B103   1:30 pm to 3:00 pm

Reflections on Academic Growth: the North Carolina Precedent.   Gary L. Williamson

There are at least two important advantages of taking a long-term (growth-curve) view of academic growth.  First and foremost, individual growth curves yield understandings and insights about growth that are not available from any other methodology.  Secondly, developmental scales based on conjoint measurement models provide unique interpretive advantages for investigations of academic growth.  In particular, interpretations of student academic growth benefit from the use of Rasch-based measurement scales that have been anchored in a real-world task continuum by means of a construct specification theory.  Two decades ago, one state had the foresight and commitment to utilize such scales with longitudinal data; their precedent continues today.  

Measurement innovations adopted in North Carolina provide interesting, insightful interpretations of academic growth, which are illustrated by a consideration of three detailed examples.  In the first example, fifteen successive statewide average reading growth curves are presented and annotated with historical policy actions related to assessment, accountability and early intervention efforts.  The second example demonstrates that a single common measurement framework can simultaneously address five interpretive perspectives, including: student reading growth; reading achievement level standards; K-12 text complexity standards; postsecondary reading demands; and, reading ability demands for occupations.  The third example makes incremental velocity norms for reading and mathematics average growth a reality, based on parametric mathematical models of individual growth curves.

More Funding in Exchange for Increased School Accountability: Policy Effects Using Quantile School Value-Added Approach.   Maria Veronica Santelices

This article explores the effect of a policy recently implemented in Chile increasing funding for schools based on the number of low-income students they enroll. In exchange to increased funding, schools commit to tighter school accountability.   School effectiveness is studied using a novel and detailed methodology: the quantile value-added approach. This approach examines school improvement for group of students at different achievement level. Results from two sets of cohorts show that the new law has not increased school effectiveness.

On the Complex Geometry of Individuality and Growth: Cook’s 1914 “Curves of Life” and Reading Measurement.   William P. Fisher, Jr., A. Jackson Stenner

Growth in reading ability varies across individuals in terms of starting points, velocities, and decelerations (Williamson, 2015). The implications for pedagogy and educational policy are complex. Clarifications emerge, perhaps surprisingly, via analogies from scientific, aesthetic, and democratic values. Cook (1914/1979) draws two applicable inferences from his extensive study of the geometry of proportions in art, architecture, and nature. First, unlike most work in this area, Cook focuses less on statistical averages expressing general patterns of logarithmic proportionality than on individual variation. Cook anticipates the point made by Kuhn and Rasch that the discovery of anomalies—and not scientific laws—is the goal of research, as these individual variations are themselves the hallmarks of growth. Second, Cook emphasizes the value of discrepancies between general laws and specific individuals. This value lies in the facts that, at least until the emergence of fractal methods, these discrepancies were not manufacturable or specifiable before being observed, and accordingly are remarkable points of entry to experiencing the mystery of life and the spell of beauty. Cook’s point is extended, following Bluecher (1968; Wonk, 2002), by drawing an analogy between, for instance, the beauty of individual variations in the Parthenon’s pillars and the democratic resilience of the unique citizen soldiers of Pericles’ Athenian army. The beauty and strength of stochastically integrated variations and uniformities in architectural, natural, and democratic principles further inform the use of discrepancies in individual growth for customized approaches to reading instruction.


Session 8 (Plenary Session)
Fellowship Hall   3:30 pm to 5:00 pm

On Rasch's criterion of invariance as a measurement theory.   David Andrich

This presentation argues that the contribution and significance of the work of the Danish mathematician Georg Rasch to the understanding and practice of social measurement which entails (i) an a-priori criterion of invariance of comparisons within a specified frame of reference, (ii) rendering this criterion in the form of a class of probabilistic models which have sufficient statistics for all parameters, together with a body of knowledge of statistical inference, and (iii) anchoring the models in an empirical paradigm of item and test construction, meets the range of criteria required of a measurement theory. It is suggested that when the construction of an instrument does meet Rasch’s measurement criteria, and the Rasch model is not used merely to model data with the intention of abandoning it if it does not do so, then the terms “Rasch model analysis”, “analysis according to the Rasch model”, and the like, should be substituted by “Rasch Measurement Theory analysis ”, “analysis according to the Rasch Measurement Theory”, and so on.

Future Directions of Rasch Measurement.   Jack Stenner, Mark Stone

This paper addresses the foundations of Rasch Measurement by presenting a number of matters for further consideration and future attention.   These include modes for the dissemination of Rasch model information, status of software, new avenues for implementing Rasch models, theory vs. data, addressing causality, greater attention to producing and procuring valid and representative data, individual-centered vs. group-centered statistics, and more clarity in presenting measurement data to the public.