Defying Data Scarcity: High-Performance Indonesian Short Answer Grading via Reasoning-Guided Language Model Fine-Tuning

Muhammad Naufal Faza; Prima Dewi Purnamasari; Anak Agung Putri Ratna

doi:10.62146/ijecbe.v3i3.148

Authors

Muhammad Naufal Faza Universitas Indonesia https://orcid.org/0009-0001-9081-386X
Prima Dewi Purnamasari Universitas Indonesia https://orcid.org/0000-0002-5851-1984
Anak Agung Putri Ratna Universitas Indonesia https://orcid.org/0000-0002-1834-451X

DOI:

https://doi.org/10.62146/ijecbe.v3i3.148

Keywords:

Reasoning-Guided Fine-Tuning, Regularization, Automated Short Answer Grading, Knowledge Distillation, Parameter-Efficient Fine-Tuning, Natural Language Processing, Data Scarcity, Indonesian Language, Educational Technology

Abstract

Automated Short Answer Grading (ASAG) is crucial for scalable feedback, but applying it to low-resource languages like Indonesian is challenging. Modern Large Language Models (LLMs) severely overfit small, specialized educational datasets, limiting utility. This study compares nine traditional machine learning models against two fine-tuning strategies for Gemma-3-1b-it on an expanded Indonesian ASAG dataset (n=220): (a) standard fine-tuning predicting only scores, and (b) a proposed reasoning-guided approach where the model first generates a score rationale using knowledge distillation before predicting the score. The reasoning-guided model (Gemma-3-1b-ASAG-ID-Reasoning) achieved state-of-the-art performance (QWK 0.7791; Spearman’s 0.8276), significantly surpassing the best traditional model in this study (SVR, QWK 0.6952). This work advances foundational LSA-based approaches for this task by introducing a more robust methodology and evaluation framework. Crucially, standard fine-tuning (Gemma-3-1b-ASAG-ID) suffered catastrophic overfitting (QWK 0.7279), indicated by near-perfect training but poor test scores. While the reasoning-guided LLM showed superior accuracy, it required over 35 times more inference time. Results demonstrate that distilled reasoning acts as a powerful regularizer, compelling the LLM to learn underlying grading logic rather than memorizing pairs, establishing a viable method for high-performance ASAG in data-scarce environments despite computational trade-offs.

Author Biographies

Muhammad Naufal Faza, Universitas Indonesia

Department of Electrical Engineering, Faculty of Engineering, Universitas Indonesia, Depok, Indonesia

Prima Dewi Purnamasari, Universitas Indonesia

Department of Electrical Engineering, Faculty of Engineering, Universitas Indonesia, Depok, Indonesia

Anak Agung Putri Ratna, Universitas Indonesia

Department of Electrical Engineering, Faculty of Engineering, Universitas Indonesia, Depok, Indonesia

References

S. Haller, A. Aldea, C. Seifert, and N. Strisciuglio, “Survey on Automated Short Answer Grading with Deep Learning: from Word Embeddings to Transformers,” Mar. 2022, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2204.03503

R. Weegar and P. Idestam-Almquist, “Reducing Workload in Short Answer Grading Using Machine Learning,” Int J Artif Intell Educ, vol. 34, no. 2, pp. 247–273, Jun. 2024, doi: 10.1007/S40593-022-00322-1/TABLES/3.

L. Anglin, K. Anglin, P. L. Schumann, and J. A. Kaliski, “Improving the Efficiency and Effectiveness of Grading Through the Use of Computer-Assisted Grading Rubrics,” Decision Sciences Journal of Innovative Education, vol. 6, no. 1, pp. 51–73, Jan. 2008, doi: 10.1111/J.1540-4609.2007.00153.X.

F. Chai, J. Ma, Y. Wang, J. Zhu, and T. Han, “Grading by AI makes me feel fairer? How different evaluators affect college students’ perception of fairness,” Front Psychol, vol. 15, p. 1221177, Feb. 2024, doi: 10.3389/FPSYG.2024.1221177/BIBTEX.

Z. (Helen) Wang, J. Pei, and J. Li, “30 Million Canvas Grading Records Reveal Widespread Sequential Bias and System-Induced Surname Initial Disparity,” Oct. 16, 2023. Accessed: Jun. 10, 2025. [Online]. Available: https://papers.ssrn.com/abstract=4603146

J. M. Malouff, A. J. Emmerton, and N. S. Schutte, “The Risk of a Halo Bias as a Reason to Keep Students Anonymous During Grading,” Teaching of Psychology, vol. 40, no. 3, pp. 233–237, 2013, doi: 10.1177/0098628313487425;WEBSITE:WEBSITE:SAGE;JOURNAL:JOURNAL:TOPA;WGROUP:STRING:PUBLICATION.

J. Klein, “The failure of a decision support system: inconsistency in test grading by teachers,” Teach Teach Educ, vol. 18, no. 8, pp. 1023–1033, Nov. 2002, doi: 10.1016/S0742-051X(02)00057-4.

Ó. Cuéllar, M. Contero, and M. Hincapié, “Personalized and Timely Feedback in Online Education: Enhancing Learning with Deep Learning and Large Language Models,” Multimodal Technologies and Interaction 2025, Vol. 9, Page 45, vol. 9, no. 5, p. 45, May 2025, doi: 10.3390/MTI9050045.

E. Del Gobbo, A. Guarino, B. Cafarelli, · Luca Grilli, and L. Grilli, “GradeAid: a framework for automatic short answers grading in educational contexts-design, implementation and evaluation,” Knowl Inf Syst, vol. 65, pp. 4295–4334, 2023, doi: 10.1007/s10115-023-01892-9.

C. Zhao, M. Silva, and S. Poulsen, “Language Models are Few-Shot Graders,” Feb. 2025, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2502.13337

A. F. Aji et al., “One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia,” Proceedings of the Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 7226–7249, Mar. 2022, doi: 10.18653/v1/2022.acl-long.500.

S. Cahyawijaya et al., “NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages,” pp. 921–945, Sep. 2023, doi: 10.18653/v1/2023.ijcnlp-main.60.

L. Susanto, R. Diandaru, A. Krisnadhi, A. Purwarianti, and D. T. Wijaya, “Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in Indonesia,” pp. 100–115, Nov. 2023, doi: 10.18653/v1/2023.sealp-1.8.

H. Xu, W. Gan, Z. Qi, J. Wu, and P. S. Yu, “Large Language Models for Education: A Survey,” May 2024, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2405.13001

K. Tirumala, A. H. Markosyan, L. Zettlemoyer, and A. Aghajanyan, “Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models,” Adv Neural Inf Process Syst, vol. 35, May 2022, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2205.10770

D. Hernandez et al., “Scaling Laws and Interpretability of Learning from Repeated Data,” May 2022, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2205.10487

S. A. Mahmood and M. A. Abdulsamad, “Automatic assessment of short answer questions: Review,” Edelweiss Applied Science and Technology, vol. 8, no. 6, pp. 9158–9176, 2024, Accessed: Jun. 10, 2025. [Online]. Available: https://ideas.repec.org/a/ajp/edwast/v8y2024i6p9158-9176id3956.html

N. LaVoie, J. Parker, P. J. Legree, S. Ardison, and R. N. Kilcullen, “Using Latent Semantic Analysis to Score Short Answer Constructed Responses: Automated Scoring of the Consequences Test,” Educ Psychol Meas, vol. 80, no. 2, p. 399, Apr. 2019, doi: 10.1177/0013164419860575.

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp. 3982–3992, Aug. 2019, doi: 10.18653/v1/d19-1410.

“6. SentenceTransformers Documentation — Sentence Transformers documentation.” Accessed: Jun. 10, 2025. [Online]. Available: https://sbert.net/

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 1, pp. 4171–4186, Oct. 2018, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/1810.04805

OpenAI et al., “GPT-4 Technical Report,” Mar. 2023, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2303.08774

G. Team et al., “Gemma 3 Technical Report,” Mar. 2025, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2503.19786

Z. Han, C. Gao, J. Liu, J. Zhang, and S. Q. Zhang, “Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey,” Mar. 2024, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2403.14608

E. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” ICLR 2022 - 10th International Conference on Learning Representations, Jun. 2021, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2106.09685

J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” Adv Neural Inf Process Syst, vol. 35, Jan. 2022, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2201.11903

“12. deepseek-ai/DeepSeek-R1-0528 · Hugging Face.” Accessed: Jun. 10, 2025. [Online]. Available: https://huggingface.co/deepseek-ai/DeepSeek-R1-0528

F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” Nov. 2020.

W. Wongso, D. S. Setiawan, S. Limcorn, and A. Joyoadikusumo, “NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural,” Mar. 2024, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2403.01817

W. Puspitasari, D. Ramdani, and A. M. Maulana, “IndoT5 (Text-to-Text Transfer Transformer) Algorithm for Paraphrasing Indonesian Language Islamic Sermon Manuscripts,” Khazanah Journal of Religion and Technology, vol. 2, no. 2, pp. 63–73, Jan. 2024, doi: 10.15575/KJRT.V2I2.1093.

A. A. Putri Ratna, H. Khairunissa, A. Kaltsum, I. Ibrahim, and P. D. Purnamasari, “Automatic Essay Grading for Bahasa Indonesia with Support Vector Machine and Latent Semantic Analysis,” ICECOS 2019 - 3rd International Conference on Electrical Engineering and Computer Science, Proceeding, pp. 363–367, Oct. 2019, doi: 10.1109/ICECOS47637.2019.8984528.

M. C. Wijaya, “Automatic Short Answer Grading System in Indonesian Language Using BERT Machine Learning,” Revue d’Intelligence Artificielle, vol. 35, no. 6, pp. 503–509, Dec. 2021, doi: 10.18280/RIA.350609.

M. Kharis, K. Laksono, and Suhartono, “Utilization of NLP-Technology in Current Applications for Education and Research by Indonesian Student, Teacher, and Lecturer,” Journal of Higher Education Theory and Practice, vol. 22, no. 14, pp. 170–178, Nov. 2022, doi: 10.33423/JHETP.V22I14.5544.

Defying Data Scarcity: High-Performance Indonesian Short Answer Grading via Reasoning-Guided Language Model Fine-Tuning

Authors

DOI:

Keywords:

Abstract

Author Biographies

Muhammad Naufal Faza, Universitas Indonesia

Prima Dewi Purnamasari, Universitas Indonesia

Anak Agung Putri Ratna, Universitas Indonesia

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Make a Submission

Information

IJECBE is indexed by