How to Think About Assessment

May 05, 2024

Knowledge and Skills Need to be Assessed Separately

Assessment has a number of purposes. These are often in conflict. For example, if we want to know what students know, we have to give them knowledge based tests.

But we can’t infer from these what students can do.

So, if we set assessments where students apply what they know, we get some idea of what they can do. But if they do badly on that assessment, is it because they lack the understanding of their knowledge, in order to apply it, or is it that they just don’t have enough knowledge to apply it to the questions?

This difficulty can be solved with 1 solution:

Have separate tests for knowledge and application, or place them in separate sections of a test.

This presents us with our first training needs:

In your subject, what is the foundational knowledge all students should have?
What sorts of questions can be set to test that knowledge?
What makes effective multiple choice questions and can we automate the marking of them?
What are the application tasks which students should do that test their ability in the skills we would value, even if there were no GCSEs?

What is a Valid Assessment?

Next, we need the assessments to be valid. This means that they allow us to draw accurate conclusions about what students do and don’t know.

Here are the training needs associated with that:

Identify the core knowledge and produce knowledge organisers that just include this.
Train on the effects of spaced learning on long term memory, so that we understand why end of topic tests are not valid assessments.
Train on the need for each assessment to include previously taught knowledge. This has to be from all previous units, from year 7. We know this because our knowledge organisers only contain the core and foundational knowledge. Therefore students will not master our subjects if they forget it.
Consider how much of the domain of knowledge needs to be sampled in any one test for us to draw a conclusion about what students do and don’t know. (Is it 25%, 30%, 50%, 75%?) The answer will have to be tempered by the availability of time. An easy shorthand will be to look at GCSE papers and make an estimate – for example my gut reaction is that a GCSE question on the character of Macbeth will probably only sample about 30% of what students know about the all the characters and themes, but 80% of what they know about how to write an essay.
Similarly, if our knowledge organisers contain 100 pieces of knowledge, how many would we need to include in the knowledge test before we were confident their result would predict the same percentage they would score if we asked all 100 questions?

What is a Reliable Assessment?

Then we want our assessment to test our curriculum. If most students do badly on our test, it may be that the test was poorly designed, or that the curriculum was poorly designed, or poorly delivered.

This introduces the reliability of the test. This means, is it possible for all groups to achieve in the test, and would a similar cohort taking the test next year get similar results?

To design the test we can look to Blooms Taxonomy. Bloom developed these to help universities in writing their assessment criteria – test design was their primary intended use.

So, we need to separate the kinds of questions which assess each level of the taxonomy.

1. X number of questions which test the facts you need to know in order to weave baskets.

2. Y number of questions which test understanding of the importance of facts to how you weave baskets.

3. Z number of questions testing if you can apply what you know to a problem basket weavers face in designing and weaving baskets.

4. AA number of questions testing if students can analyse and evaluate a basket design or a basket and consider the pros and cons of weaving this basket.

5. BB number of questions seeing if students can design and weave a quality basket themselves.

This will test the full range of what it means to be good at basket weaving. If our test only contains question from 3, 4 and 5, up to 50% of the year group will find they get very low scores. If our test only tests 1-4, then we might get great scores, but no assessment of the one thing we want our students to be able to do, weave wonderful baskets.

So, we need to train our departments in constructing good assessments using Blooms.

Weighting of Questions in an Assessment

In practical subjects, some of these questions will have very low value. For example, if I want to produce an excellent chef, I might only assess 1 – 3 during lessons. But my real goal is to have them produce excellent meals, so my real end of topic and summative assessments will pretty much only focus on 4 and 5, or just 5.

If I want to see how great my students are at art, I just want them to produce art in the genres I have taught.

In other subjects, like history, I would place a much higher value on students knowing hundreds of facts and being able to place them chronologically.

So we need to train our departments in how to weight the marks from different parts of their assessments.

For example, you might ask 50 knowledge based questions to sample enough of the domain, but not weight this as 50% of the assessment.

Assessing the Curriculum

Assessments are our only way of judging the effectiveness of what we have taught.

Student are only year 8 once. The year 8 curriculum will probably last 5-10 years before it is overhauled.

So, assessing the curriculum is even more important than giving feedback to the students.

So, we need to train departments in how to use the results of the assessments, and those 5 parts of the assessments, to judge how well the curriculum is doing what we want it to.
Then we need to train departments in how they can collaborate so that the most successful teachers are identified, and their expertise in delivering the curriculum is shared.
This will also reveal the teachers who are less successful, so we also need to train departments on how to do this collaboratively and supportively, rather than exposing those teachers.

Feedback of Next Steps to the Students

Designing assessment questions of 5 types makes it very easy for teachers and students to understand where a student needs to improve most.

However, reporting this is an administrative burden, and this has to be weighed up against the most likely benefits, not the theoretical benefits.

Senior leaders are likely to be attracted to a model which says:

Based on questions 43, 47, and 50, we can see that you need to check your understanding of the properties of willow, and the successful treatment of it before use.

However, it is highly likely that this student also scored terribly on the knowledge questions 1-40. The feedback should be, you just don’t know enough, not, you need to know more about planting, harvesting, preparing cut willow, seasonality, the properties of dried willow, the properties of growing willow etc.

It is also highly unlikely that this student will be able to weave any kind of basket, so the best feedback is likely to be: you just don’t know enough.

If they can weave a basket, but imperfectly, the feedback should be: these are the 3 things you need to be better at, in order to weave a good basket.

In other words, the fine grained feedback promised by question level analysis is, I think, a distraction. It takes up much too much of the teacher’s time, and the value of it to the student is small.

This obviously begs a further question. If our feedback is perfect, how will we make sure the students improve?

So we need to train teams on:

How to give meaningful feedback after an assessment so that it will have most impact on what they can produce.
How to structure the curriculum following an assessment so that students produce something better.

How Will Assessments Lead to Students Knowing and Remembering More?

Departments need to think about the testing effect and spaced learning.

The timing of a knowledge based assessment should be based on the forgetting curve. We know that this means that the gaps between testing specific knowledge should roughly double. A student learning personification in September of year 7 should be tested on it one week, then two weeks later, both in September. Then a month later, the end of October. Then two months later, the end of December. Then four months later, in April. 8 months later in November of year 8. Then 16 months later in May of year 9.

If you are assessing only 3 times a year, this would be impossible.

But if you separate assessment of knowledge and perhaps understanding, level 1 and 2 of Blooms, from assessments testing levels 3 - 5, it is much easier to achieve. For example, you might have multiple choice and gap fill assessments every month, to include prior learning like the example above.

These scores could be averaged in order to arrive at the knowledge and understanding score for your assessment point, to be added to the score of the assessment every 3 months which tests application, analysis, evaluation and creation.

A further consideration is reporting dates. Departments need to be trained to calendar their assessments so that a teacher teaching each year group is never marking the assessments of more than one year group at a time.

So, we need to train departments on how to calendar assessments both to maximise the forgetting curve, and minimise workload.

Leading on Assessment

Look at the calendar of middle leader meetings.

Work out what dates each of these 17 training points will be brought to the team leaders. Calendar these.

Work out which of these will require team leaders to bring examples of: knowledge organisers, knowledge assessments, Blooms assessments, recording of results to give meaningful feedback, their department approach to how to feed back following an assessment etc. Calendar these.

Share the calendar with middle leaders so that they know what is coming and when.

Thank you for reading Dominic’s Newsletter. This post is public so feel free to share it with a leader who would like to develop assessment.

Dominic’s School Improvement Newsletter

Discussion about this post