How a score is produced
Every answer goes through the same pipeline, the same way each time:
- You answer. You speak or write a response to an exam-style task.
- Your speech is measured. For speaking, your audio is transcribed and measured for acoustic features — speaking rate, pauses, rhythm and clarity.
- It's graded on the official rubric. The answer is assessed criterion by criterion using the real exam's rubric: task response, coherence, grammar, vocabulary, and pronunciation or fluency.
- It's mapped to the real scale. The result is placed on the actual exam scale — IELTS bands, TOEFL's 1–6 CEFR bands, the Duolingo English Test's 10–160 — with a calibration step so the numbers line up with official scoring.
Then you get the score, a breakdown for each criterion, and sentence-by-sentence feedback.
How we measure it
We don't grade by feel. The scorer is benchmarked against official, examiner-scored sample answers published by the test makers — IELTS partners, ETS (TOEFL), and Duolingo. Each one is a real response with the official band or score attached.
For every sample, we run it through the exact production scoring pipeline and compare Aflo's score to the official one, using the same measures examiners use to check agreement:
- MAE — how far Aflo's score is from the official score, on average.
- Exact and adjacent agreement — how often Aflo lands on the official band, or within one of it.
- QWK (quadratic weighted kappa) — the standard statistic for agreement between two raters.
Every change to the scorer is re-measured against these references and kept only if agreement improves. The benchmark runs the same code that runs in production, so the numbers reflect what you actually get.
How close it lands
Measured against the official examiner scores:
| What | Agreement with official scores |
|---|---|
| IELTS (writing & speaking) | within about 0.5–0.7 of a band, on average |
| Duolingo English Test (speaking) | QWK ≈ 0.90 — strong agreement |
| Pronunciation & fluency | ρ ≈ 0.8 vs human ratings |
On IELTS, that is close to how closely two trained examiners agree with each other.
Pronunciation and fluency, grounded in real speech
Speaking isn't graded from the transcript alone. Aflo measures the sound of your speech — pace, pauses, rhythm and clarity — and those measurements are validated against a public research corpus of human-rated learner speech: the ICNALE corpus (Kobe University) and SpeechOcean. Aflo's fluency signal lines up with human fluency ratings at about ρ = 0.8.
A score you can act on
A number on its own doesn't help you improve. For every answer, Aflo shows the score for each criterion, then goes sentence by sentence — what's working, what's holding the score down, and how to fix it. You can see exactly why a score is what it is, and what to change to raise it.
Frequently asked questions
What scale does Aflo use?
The real scale for each exam: IELTS bands (0–9), TOEFL's 1–6 CEFR bands, and the Duolingo English Test's 10–160.
What does the score compare against?
Official sample answers that the test makers published with the examiner's own band or score. Aflo runs each one through the production pipeline and compares its score to the official one.
What is QWK?
Quadratic weighted kappa — the standard statistic for how closely two raters agree, used widely in exam-scoring research. Higher is better; around 0.9 means strong agreement.