Selection procedures

The Advisory Board agreed that the materials would at this stage cover the same five languages as in the 2005 CD.

Criteria

It was agreed that the criteria for the collection and selection of items and tasks would be the following:

Openness – The process of conducting and reporting the project should be transparent.
Inclusiveness – All institutions willing to participate in the project would be entitled to contribute materia.
Quality – The selection of the material would be based on documented quality.
Balance – The selected material would provide a good coverage of the relevant types and formats of testing and assessing of listening and reading comprehension.
Relevance and usefulness – The selected material should be relevant to several stakeholders who are interested in quality assessment of listening and reading comprehension.

Materials and documentation

The Council of Europe invited institutions with documented expertise in relating examinations to the CEFR to participate in the collection of items and tasks for reading and listening. Contributing institutions were required to provide the following information and material:

A brief general description of the test/examination
Tasks/items for reading/listening (including sound files) to be made publicly available
A description of the submitted tasks/items (content description, statistical information, CEFR level targeted, ...)
One copy of a full reading/listening paper providing context for the submitted tasks.
A brief description of the linking process to the CEFR and any supporting documentation.

The invitation led to responses that were considerably more numerous than the institutions contributing to 2005 collection. The Advisory Board assessed the applications and the final selection was put together not only on the basis of quality but also on the representativeness and variety of formats, also taking into account different approaches in operationalizing the aspects of the constructs of reading and listening in the relevant literature and in the CEFR.

Quality of the Illustrative Tasks

The contributing institutions were asked to report data analysis results as one source of evidence of validity in the selection process in order to be able to check that the items functioned well and were appropriate for their target groups. The psychometric data associated with each item were carefully vetted and all the selected items met the standards usually referred to in the professional literature. After a critical discussion it was, however, decided not to publish these indices and their values because:

It might be misleading to use these values as such as indicators of the items´ CEFR levels without a detailed description and consideration of the target populations.
As item calibration is valid only for samples of specific populations (in terms of skills, cognitive profile, cultural and linguistic background ...), lack of adequate psychometric knowhow might contribute to questionable practices in using the data.
The role of psychometrics in the determination of item validity, while important, is not decisive alone. It helps to provide evidence and to build an argument for item quality, reliability and validity. Items are only valid when they are well written, with adequate content, for a specific context, for a well-defined population and for well identified uses. Items selected here are deemed valid and can be used in the users´ own contexts, but the statistical values associated with them can be slightly (or less frequently, even dramatically) different in another context depending on the similarity between the population used in the original item calibration and a new context.
Item difficulty is a complex concept and not easily predictable. Item difficulty is the product of tasks, candidate characteristics and interactions between them. Thus item difficulty is a probabilistic, not deterministic, outcome.
As mentioned in the introduction, the material may, however, be useful to a wide range of users, extending from language teachers to policy makers. The selected items are in line with what current literature and professional expertise require concerning claims about items being related to CEFR levels.
If a subset of these items is used as part of a broader instrument to assist in “anchoring” a test, the statistical values that the used items receive can be considered as approximate statistical indices of their correspondence with CEFR levels.
The item selection should give the users a good idea about what kind of well-designed items are identified by researchers, assessment professionals and examination bodies to be related to A1, A2, B1, B2, and C1 levels.
However, it is advisable to be aware that the selection of these items was made for serving as a tool for reference, not in order to anchor and relate any particular test/examination to CEFR.
Relating a test/examination and equating different versions of it to CEFR still requires a set of well implemented and documented procedures as outlined in the professional literature related to aligning, equating, linking and standard setting (see bibliography for references), including the Manual for Relating Examinations to the CEFR (2009).

Observations on the materials received

Working on the received materials, the Advisory Board observed some trends which are briefly discussed below with a view to inviting further research.

The analysis of the tasks and accompanying documentation showed a varied representation of the whole CEFR target level, rather than – as intended – examples at mid-level. Accordingly, the items presented at each CEFR level can be considered good examples of that level but not equivalent in their representation of the construct.
While there was a large degree of similarity across the material, there was also some variation across languages and contributing institutions. For instance, institutions might differ to some extent in preferring some test format more than some others. This is a topic which would be of interest to research: What variation is there across languages and across contexts of operation, and how might it be explained?
There was a range of test formats used frequently: multiple choice, multiple matching, reordering of text units, and short answers. The True-False (and the True-False-Not Stated) was less common.
The inclusion of items assessing lexical and grammatical knowledge was also fairly common, but such items were usually not included in the collection as they were interpreted not to assess comprehension of larger units of meaning.
The number of options in the multiple choice items was usually 3-4.
The input texts (both written and oral) were, as expected, shorter and less complex at the lower CEFR levels.
Pictures, illustrations and graphs were used quite often and more commonly at the lower CEFR levels.
In most cases only the target language was used in the testlets, including instructions. However, there were some exceptions at lower levels.
Differences were observed in the delivery of listening tasks, in terms of how instructions were given (read questions first, present all questions after the whole text or distributing them after passages, playing the script once or twice, speed of delivery, variety of accents...)
C2 examples are not included in the material. There are mainly two reasons for this decision: 1) There were relatively few samples submitted and thus there was a rather limited pool to choose from. 2) The difference between C1 and C2 tasks/items did not always appear very clear. It would seem that more effort on constructing C2 tasks and items is advisable and that it would be useful to try to make a clearer distinction between levels C1 and C2.