StandardSetting

Standard setting

The European Survey on Language Competences (ESLC) is in many ways a ground-breaking project. The language tests which were developed and the planned phases of alignment and standard-setting represent the most concerted attempt yet to produce a valid measure of functional language proficiency which is consistent and comparable across the five most-learned languages of Europe.

The Survey and the CEFR

The results of the Survey are reported in terms of the Common European Framework of Reference (CEFR). Since its publication in 2001, the CEFR has become a key instrument of language policy in Europe and beyond, particularly using the CEFR levels as the basis for defining learning targets and curricula. Language testers in Europe are now expected to provide evidence that their exams do actually correspond to the CEFR levels that they claim to.

However, this assumes a perfect shared understanding of levels across countries and languages; clearly a goal worth pursuing, but in practice far from being achieved. While it would be impractical and undesirable to enforce such an understanding, we can expect a gradual convergence of use across countries and languages, informed by authoritative points of reference. These require an explicitly multilingual focus, and we see the ESLC as the most significant and carefully-designed such study yet.

The Council of Europe has produced a Manual for relating language examinations to the CEFR (Council of Europe 2008).Though very useful, it should not be seen as defining a sufficient set of procedures for this purpose. In particular, the manual does not explicitly deal with the multilingual dimension which is critical for the ESLC. Given that much of the work involved in developing the Survey is either unique or makes new use of existing methods, it is appropriate that we should apply and cross-validate a variety of sources of evidence.

Evidence from test design and implementation

An important source of evidence concerns the comparability of the language tests themselves. This has been a focus of attention from the test design stage.

A single test design was adopted. Sets of testable subskills were developed for Reading, Listening and Writing, drawn from the descriptor scales of the CEFR at levels A1 to B2.
Each subskill was mapped to a specific task type, e.g. multiple choice tasks.
Writing is made directly comparable by using what is essentially the same set of tasks rendered into the different languages: for example, writing a holiday postcard, or a review of a film.
For Reading and Listening a subset of tasks have also been adapted across languages. Trials indicated general comparability in terms of resulting difficulty.
The test developers worked closely together, vetting and comparing their materials in an explicitly cross-language frame of reference.

This approach allows us some confidence not only that language skills are defined and measured in the same way, but that test tasks should be broadly of comparable difficulty, and test scores thus comparable in terms of ability.

Evidence from can-do statements

Sixteen CEFR-linked can-do statements were included in the student questionnaire to elicit students’ self-ratings of proficiency in four skills. These data provide a framework against which standards can be compared. Such self-assessments have been found useful in previous studies on interpreting CEFR-like frameworks (Jones 2002:167-183).

Evidence from cross-language alignment (Writing)

Writing is the easiest skill to align across languages because samples of written performance can be directly compared with each other by raters who have a competence in two languages. Comparison should involve performance on similar tasks, as is possible in the case of Surveylang.

An alignment study involving the judgments of 125 raters, administered via email over a period of a month, was completed. The data were the transcribed performances of students from the Main Study and the task of the raters was not to assign levels, but to rank in order of quality sets of performances in two languages on particular tasks. The ranking data were analysed using a form of the Rasch model to put all the performances in all languages onto a single ability scale. This replicated a procedure successfully applied to speaking performances by SurveyLang partner CIEP in June 2008 (CIEP 2008, Jones 2009:35-44).

The outcomes of the alignment study form a further source of evidence for finalising the standards set for each language.

Evidence from Standard Setting Conference: Cambridge 26-30 September 2011

This major event brought together over 70 experts in each of the five languages as well as language professionals to set standards for the test results from the Main Study held earlier this year.

The conference was organised around procedures largely taken from the CEFR Manual mentioned above: a familiarisation phase followed by several iterations of standard setting for each skill. The five language teams worked separately. Reading and Listening made use of the task-based standard setting procedure described in the Manual (the CITO variation on the bookmark method), and were coordinated by Norman Verhelst, former Director of the ESLC, and one of the Manual’s authors. The method requires raters to estimate the scores of a student who just achieves a given CEFR level, given the text of the task and information about its empirical difficulty relative to other tasks. The third iteration of the procedure presented the information in a slightly different way and was included as a check on the previous sessions.

Writing used a novel approach which required the raters to classify, for each task, a set of sample performances as achieving or failing the CEFR level intended by the task. In this it had similarities with the alignment study described above. The third iteration used the more familiar Body of Work method, where raters saw the complete written performances of 30 students and assigned each student to a CEFR level. As with Reading and Listening, this enabled a check on the previous method.

Finalising Standards

Reconciling the different sources of evidence following the standard setting conference proved complex. It is important to arrive at outcomes which serve as well as possible the requirements of a language indicator: a replicable picture of proficiency in CEFR terms, maximising coherence across languages and minimising the risk of reporting possibly contingent effects as substantive.

Erna_Standard setting Sylvie_standard setting

Standard_setting1 Standard_setting2