Lost in the Supermarket
This week I was in a branch of a major supermarket trying to find some new swimming shorts for my 2-year-old son. Amongst the cartoon-violence-film-franchise trunks and unicorn-loveheart child bikinis I chanced upon three pairs of day-glo knee-length shorts – one pink, one yellow and one green, each in eye-watering neon with a black stripe across them. The vividness of the solid blocks of acid colour pitched me straight back to the summers of my youth. I would have seriously coveted these shorts in my pre-teen years and I felt a pang of nostalgia for a time when I didn’t yet know that the answers to life’s great mysteries were, very often, themselves mysterious.
It turns out that I am not the only one who has been wondering what it would be like to return to 1991…
An Introduction From a Trusted Friend
The Historical Association recently told me that they were “pleased” to bring me a message sent via email from the publishing/examination behemoth Pearson touting a tool to help me “find new ways to track and report on your students’ progress in History after the removal of National Curriculum Levels”.
However, when I looked at what was on offer my heart sank like Cambridge United’s dreams of winning the Second Division playoffs. 
Let me explain my disappointment. The package on offer claimed it would allow me access to a ‘Progression Map’ that “builds on our 12 step scale, breaking down the curriculum and providing clear progress descriptors, prior knowledge requirements and boosters for additional challenge.”
At first glance this might look like Pearson have just replaced the National Curriculum level descriptors with Pearson-defined level descriptors. However, unlike the National Curriculum level descriptors, the Pearson ‘Steps’ are designed to be applied to individual pieces of work and are designed to be divided ‘horizontally’ to allow fine grading.
Also, unlike every National Curriculum since 1991 that has had level descriptors that deliberately interwove their different elements, Pearson has taken the trouble of dividing the descriptors ‘vertically’ as well; divorcing ‘Cause and Consequence’, ‘Change and Continuity’, ‘Evidence’, ‘Interpretations’, ‘Structuring and Organising Knowledge’, ‘Using historical vocabulary’ and ‘Chronological Understanding’ into different ‘sub-strands’. This would, I suppose, allow you to clearly separate your assessment of a student’s understanding of ‘Using historical vocabulary’ from their understanding of ‘Cause and Consequence’ etc.
However, there is more. In order to keep track of my students’ progress through this 12-step programme they had, “developed a straightforward, time-saving and reliable approach to monitor learning throughout KS3 and KS4.”
What Pearson is selling is a re-write of the 1991 National Curriculum and an Excel spreadsheet.
Why I Won’t Be Buying
It is sad that an international organisation such as Pearson with its abundance of resources and huge influence is peddling something that is so conceptually flawed. The criticisms levelled at the (mis-)use of the NC Levels  are exactly applicable to this system and while retro seems always to be the order of the day, the memories of those arguments are too fresh and the scars to raw to revive them here. However, what is worth saying is that Pearson are selling this conceptually-flawed product without having taken the trouble to even address the flaws in its execution.
It would be churlish to cherry-pick and isolate phrases from the Step Sub-Strand descriptors to challenge or question. I’m not going to spend time here making jokes about flux-capacitors and the phrase “Learners are able to manoeuvre within their own chronological framework with ease”. I’m not going to ask you whether “starting to make judgements about sources and how they can be used for a specified enquiry” is more or less difficult than making “supported inferences about the past by using a source and the detail contained within it.” Nor, am I going to point out that a huge publishing house has published documents that use both spellings of judge(e)ment on the same page. It is precisely because writing these generalised descriptors is so hard that their creation is meaningless. It is for precisely these reasons that the application of these generalised statements to individual pieces of work is meaningless.
I would, however, like to draw your attention to the Baseline Test that Pearson invites teachers to set Year 7 after a brief topic on the Norman Conquest.
The idea of a baseline test for the beginning of a Key Stage is a good one – it gives the teacher some idea of the strengths and weaknesses of their students. It can be used to tailor support, intervention, extension etc. etc. With appropriate caveats, it is not unreasonable to compare the results of this test with later ones to help inform some judgements about students’ progress and, perhaps, the efficacy of some aspects of a teacher’s performance.
However, in order that a baseline test is effective it must be a fair test. The Pearson Year 7 Baseline test is flawed in many, many ways: 
3 – Are ‘Romans’ and era? Shouldn’t this read ‘Roman Britain’, ‘the Roman period’, ‘the Roman era’, ‘The period of the ascendency of Romano-British culture in South Eastern England’…?
4 – If ‘The Dark Ages’ is an era, don’t at least two of these labels also require the definite article?
5 – An emperor or empress would be the ruler of an empire (and an Empire?) and be just as much of a monarch as a king or queen.
6 – I looked at the mark scheme and realised that I got this one wrong.
7a – It would not be unreasonable to describe a way of explaining a set of historical facts as a ‘cause’. A cause is identified (constructed?) by a historian, therefore it is a way of explaining historical facts. An ‘interpretation’ is not a way of explaining historical facts it is a construction made from (the selection of those things that the historian determines are pertinent) facts.
7b – Interpretations happen because of something else. That is in the nature of ‘interpretations’ of history: a historian’s Marxist beliefs will cause them to have a Marxist interpretation etc. Long-term causes of historical events are, in turn, caused by other things.
7c – A short-term cause of William’s victory at Hastings didn’t happen a short while ago. Things that happened a short while ago and had an impact can also be consequences of something else.
8 – I’m not even going to start to pretend that I understand the subtleties of what (bastard) feudalism is/was/whether it ever existed… but I do know that it would be perfectly reasonable to offer, “Because it wasn’t a feudal society,” as an answer to 8b. Would this count as an explanation?
10 – Wouldn’t it be more useful to phrase the question as the difference between what the historians are saying in Interpretations 1 and 2?
13 – This implies that the historian’s questioning itself is evidence of why William won as if William ushered in a new era of evidential thinking in the discipline of history. I think they mean ‘usefulness’.
Does It Really Matter?
Okay, so some of the questions are clumsy in their execution and some suggest some clumsy thinking. Again, this wouldn’t be terrible if you cooked this up with a colleague in the last week of term because you needed an end-of-year test but if you are one of the world’s largest educational publishers it’s probably a bit embarrassing. However, I would argue that much more importantly (and I know that some friends and colleagues will roll their eyes at this point and suggest that I have spent too long in the company of Mr. Hyperbole) that the system that supports this test is dangerous and unhelpful.
It is not unreasonable to give numerical scores to questions on a history test. What is unreasonable is to use those numbers to draw unsupportable conclusions.
According to Pearson’s Baseline Test Markbook, all elements of question 7 are at a Step 4 level of difficulty but each answer is worth only 1 mark. This is the same value as question 1 which is only rated as Step 3 level. This happens all over the test and this causes problems.
While I appreciate that the screenshot of the fake data in the markbook is probably illegible, please take my word that students Joseph Bloggs and Anne Nother have the same overall score: Step 2 Developing . This is despite the fact that Joseph Bloggs had failed to get right any of the simpler questions (i.e. those rated at Step 2) but aced those rated 4 and got somewhere with those rated 7. Anne Nother got all of the simpler questions right but did less well on the more difficult ones. Are these students at the same level?
Well, yes and no. The data about how each student performed on each individual question is interesting and can be useful and pertinent. However, this system is designed to smooth out all of the nuance and produce a summative grade. This in itself is still not necessarily a problem. So long as everybody is clear that the grade given refers only to the performance of that student on that day on that particular test, this average can have some meaning. However, the problem is Pearson are implying that the score on that test has some relation to a student’s capacity to perform according to complex level descriptors.
It does not.
The fact that the students scored 13 marks on the test tells us that they got 13 on that test. It is fair to say that the Bloggs scored below the class average on that test. It is fair say that Nother scored 26% on that test. It is fair to say that one of them probably doesn’t know what ‘a decade’ is because they got question 2 wrong.
It is not fair to use that score to describe either of them as Developing Step 2. The score in no way relates to the descriptors. If a student is doing some parts of Step 7 but not Step 2, it doesn’t mean that they are doing the things described in Step 4.
Creating mean averages and then extrapolating judgements about a student’s capabilities is a gross over-simplification and while it does all people to generate pretty line graphs it is impossible that they generate any meaningful information. They generate a lot of noise but very little signal.
The Illusion of Reliability
So what you ask? So, the system is imperfect; it’s better than nothing. It gives heads of department/heads of year/heads some rough-and-ready data to help them out. I would strongly argue that it is much, much worse than nothing. The problem lies in the illusion of reliability that numbers give information – if you put together a system that generates numbers, it won’t be long before some idiot assumes that they mean something. After that, it won’t be long before people are judged on whether those numbers appear next to particular students’ names. After that, it won’t be long before sets/rewards/trips/badges or promotions/pay awards/professional reputation/the ability to put food in your child’s mouth are dependent on those numbers. After that, it won’t be long before the stakes for not getting the numbers are so high that teaching is to the test and marking is done with one eye on self-preservation. After that the numbers obscure the things that they are supposed to be measuring. After that, habit, fear and exhaustion will lead us to a place where we are teaching students how to get numbers rather than get excited about the past.
The abolition of the National Curriculum Level Descriptors has provided us as professionals such a wonderful opportunity. Paying money for a system like Pearson’s Progression Services is just Stockholm Syndrome – a self-defeating desire for the comfort of our previous imprisonment – don’t succumb.
 Okay, so technically that match was in 1992 but that season started in 1991.
 A student can be ‘beginning’ step 4, ‘developing’ step 4, ‘securing’ step 4 or ‘excelling’ step 4… no, hang on… ‘beginning’ to understand step 4, ‘developing’ to understand… no, hang on… ‘beginning’ to perform at step 4-level, ‘developing’ to perform at… no, hang on.. ‘securing’ their understanding of step 4 before ‘developing’ their… no, hang on… 4a, 4b, 4c… or was it 4c, 4b, 4a…?
 For example, Burnham and Brown in Teaching History 115 & 157, Fordham in Teaching History Supplement 153, Ofsted in History for All, 2011, Final report of the Commission on Assessment without Levels, 2015.
 I am prepared to admit that at least some of these criticism verge on pedantry. However, had any of my colleagues suggested these questions I would ask them to consider the following changes. If we expect accuracy and clarity of thought from our students, shouldn’t we expect it from ourselves? However, if you have a low-tolerance for smug nit-picking please feel free to skip on to the section entitled “Does It Really Matter?”
 They are developing Step 2? Their understanding of the historical thinking required to achieve Step 2 is developing from slight to comprehensive?? They are developing Step 2 into Step 3???
 Step 2 Descriptors:
Cause and Consequence Step descriptor: Learners show a basic comprehension of causes and understand that things happen in the past for more than one reason. However, they view these relationships as unmoving or definite, i.e. X was always going to cause Y. They may display a simple understanding of consequence.
Change and Continuity Step descriptor: Learners can identify basic differences between our lives and the lives of people in the past, but will often see the present as a time when problems of the past have been solved or sorted out.
Evidence Step descriptor: Learners have a sense that historians need to look at evidence about the past to find out what happened, but they see this evidence as independent and able to speak for itself. For example, they may believe that a report or relic has its own truth without any interrogation.
Interpretations Step descriptor: Learners can decide what they think about the past (e.g. I think that King John was bad) but cannot link this idea to the way in which history is constructed. They may be able to repeat stories that they have been told about the past, but cannot see that these stories are interpretations.
Knowledge Step descriptors: Learners begin to use simple historical terms, such as years, and understand that some things happened a long time ago. However, they are unable to distinguish between different lengths of time. They may be able to talk about periods that they have studied (e.g. Ancient Greeks, Romans) but cannot fit these into their existing knowledge. Learners can remember historical vocabulary with some relevance within a given period (e.g. Roman emperors, Viking longships) but struggle to use it to describe the period or features of the period.
Learners can recount simple stories about the past (e.g. myths, battles) but are unable to move beyond what they have already been told or to combine knowledge together.