Defying Data Scarcity: High-Performance Indonesian Short Answer Grading via Reasoning-Guided Language Model Fine-Tuning
DOI:
https://doi.org/10.62146/ijecbe.v3i3.148Keywords:
Reasoning-Guided Fine-Tuning, Regularization, Automated Short Answer Grading, Knowledge Distillation, Parameter-Efficient Fine-Tuning, Natural Language Processing, Data Scarcity, Indonesian Language, Educational TechnologyAbstract
Automated Short Answer Grading (ASAG) is crucial for scalable feedback, but applying it to low-resource languages like Indonesian is challenging. Modern Large Language Models (LLMs) severely overfit small, specialized educational datasets, limiting utility. This study compares nine traditional machine learning models against two fine-tuning strategies for Gemma-3-1b-it on an expanded Indonesian ASAG dataset (n=220): (a) standard fine-tuning predicting only scores, and (b) a proposed reasoning-guided approach where the model first generates a score rationale using knowledge distillation before predicting the score. The reasoning-guided model (Gemma-3-1b-ASAG-ID-Reasoning) achieved state-of-the-art performance (QWK 0.7791; Spearman’s 0.8276), significantly surpassing the best traditional model in this study (SVR, QWK 0.6952). This work advances foundational LSA-based approaches for this task by introducing a more robust methodology and evaluation framework. Crucially, standard fine-tuning (Gemma-3-1b-ASAG-ID) suffered catastrophic overfitting (QWK 0.7279), indicated by near-perfect training but poor test scores. While the reasoning-guided LLM showed superior accuracy, it required over 35 times more inference time. Results demonstrate that distilled reasoning acts as a powerful regularizer, compelling the LLM to learn underlying grading logic rather than memorizing pairs, establishing a viable method for high-performance ASAG in data-scarce environments despite computational trade-offs.
References
S. Haller, A. Aldea, C. Seifert, and N. Strisciuglio, “Survey on Automated Short Answer Grading with Deep Learning: from Word Embeddings to Transformers,” Mar. 2022, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2204.03503
R. Weegar and P. Idestam-Almquist, “Reducing Workload in Short Answer Grading Using Machine Learning,” Int J Artif Intell Educ, vol. 34, no. 2, pp. 247–273, Jun. 2024, doi: 10.1007/S40593-022-00322-1/TABLES/3.
L. Anglin, K. Anglin, P. L. Schumann, and J. A. Kaliski, “Improving the Efficiency and Effectiveness of Grading Through the Use of Computer-Assisted Grading Rubrics,” Decision Sciences Journal of Innovative Education, vol. 6, no. 1, pp. 51–73, Jan. 2008, doi: 10.1111/J.1540-4609.2007.00153.X.
F. Chai, J. Ma, Y. Wang, J. Zhu, and T. Han, “Grading by AI makes me feel fairer? How different evaluators affect college students’ perception of fairness,” Front Psychol, vol. 15, p. 1221177, Feb. 2024, doi: 10.3389/FPSYG.2024.1221177/BIBTEX.
Z. (Helen) Wang, J. Pei, and J. Li, “30 Million Canvas Grading Records Reveal Widespread Sequential Bias and System-Induced Surname Initial Disparity,” Oct. 16, 2023. Accessed: Jun. 10, 2025. [Online]. Available: https://papers.ssrn.com/abstract=4603146
J. M. Malouff, A. J. Emmerton, and N. S. Schutte, “The Risk of a Halo Bias as a Reason to Keep Students Anonymous During Grading,” Teaching of Psychology, vol. 40, no. 3, pp. 233–237, 2013, doi: 10.1177/0098628313487425;WEBSITE:WEBSITE:SAGE;JOURNAL:JOURNAL:TOPA;WGROUP:STRING:PUBLICATION.
J. Klein, “The failure of a decision support system: inconsistency in test grading by teachers,” Teach Teach Educ, vol. 18, no. 8, pp. 1023–1033, Nov. 2002, doi: 10.1016/S0742-051X(02)00057-4.
Ó. Cuéllar, M. Contero, and M. Hincapié, “Personalized and Timely Feedback in Online Education: Enhancing Learning with Deep Learning and Large Language Models,” Multimodal Technologies and Interaction 2025, Vol. 9, Page 45, vol. 9, no. 5, p. 45, May 2025, doi: 10.3390/MTI9050045.
E. Del Gobbo, A. Guarino, B. Cafarelli, · Luca Grilli, and L. Grilli, “GradeAid: a framework for automatic short answers grading in educational contexts-design, implementation and evaluation,” Knowl Inf Syst, vol. 65, pp. 4295–4334, 2023, doi: 10.1007/s10115-023-01892-9.
C. Zhao, M. Silva, and S. Poulsen, “Language Models are Few-Shot Graders,” Feb. 2025, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2502.13337
A. F. Aji et al., “One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia,” Proceedings of the Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 7226–7249, Mar. 2022, doi: 10.18653/v1/2022.acl-long.500.
S. Cahyawijaya et al., “NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages,” pp. 921–945, Sep. 2023, doi: 10.18653/v1/2023.ijcnlp-main.60.
L. Susanto, R. Diandaru, A. Krisnadhi, A. Purwarianti, and D. T. Wijaya, “Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in Indonesia,” pp. 100–115, Nov. 2023, doi: 10.18653/v1/2023.sealp-1.8.
H. Xu, W. Gan, Z. Qi, J. Wu, and P. S. Yu, “Large Language Models for Education: A Survey,” May 2024, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2405.13001
K. Tirumala, A. H. Markosyan, L. Zettlemoyer, and A. Aghajanyan, “Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models,” Adv Neural Inf Process Syst, vol. 35, May 2022, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2205.10770
D. Hernandez et al., “Scaling Laws and Interpretability of Learning from Repeated Data,” May 2022, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2205.10487
S. A. Mahmood and M. A. Abdulsamad, “Automatic assessment of short answer questions: Review,” Edelweiss Applied Science and Technology, vol. 8, no. 6, pp. 9158–9176, 2024, Accessed: Jun. 10, 2025. [Online]. Available: https://ideas.repec.org/a/ajp/edwast/v8y2024i6p9158-9176id3956.html
N. LaVoie, J. Parker, P. J. Legree, S. Ardison, and R. N. Kilcullen, “Using Latent Semantic Analysis to Score Short Answer Constructed Responses: Automated Scoring of the Consequences Test,” Educ Psychol Meas, vol. 80, no. 2, p. 399, Apr. 2019, doi: 10.1177/0013164419860575.
N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp. 3982–3992, Aug. 2019, doi: 10.18653/v1/d19-1410.
“6. SentenceTransformers Documentation — Sentence Transformers documentation.” Accessed: Jun. 10, 2025. [Online]. Available: https://sbert.net/
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 1, pp. 4171–4186, Oct. 2018, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/1810.04805
OpenAI et al., “GPT-4 Technical Report,” Mar. 2023, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2303.08774
G. Team et al., “Gemma 3 Technical Report,” Mar. 2025, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2503.19786
Z. Han, C. Gao, J. Liu, J. Zhang, and S. Q. Zhang, “Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey,” Mar. 2024, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2403.14608
E. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” ICLR 2022 - 10th International Conference on Learning Representations, Jun. 2021, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2106.09685
J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” Adv Neural Inf Process Syst, vol. 35, Jan. 2022, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2201.11903
“12. deepseek-ai/DeepSeek-R1-0528 · Hugging Face.” Accessed: Jun. 10, 2025. [Online]. Available: https://huggingface.co/deepseek-ai/DeepSeek-R1-0528
F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” Nov. 2020.
W. Wongso, D. S. Setiawan, S. Limcorn, and A. Joyoadikusumo, “NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural,” Mar. 2024, Accessed: Jun. 10, 2025. [Online]. Available: https://arxiv.org/pdf/2403.01817
W. Puspitasari, D. Ramdani, and A. M. Maulana, “IndoT5 (Text-to-Text Transfer Transformer) Algorithm for Paraphrasing Indonesian Language Islamic Sermon Manuscripts,” Khazanah Journal of Religion and Technology, vol. 2, no. 2, pp. 63–73, Jan. 2024, doi: 10.15575/KJRT.V2I2.1093.
A. A. Putri Ratna, H. Khairunissa, A. Kaltsum, I. Ibrahim, and P. D. Purnamasari, “Automatic Essay Grading for Bahasa Indonesia with Support Vector Machine and Latent Semantic Analysis,” ICECOS 2019 - 3rd International Conference on Electrical Engineering and Computer Science, Proceeding, pp. 363–367, Oct. 2019, doi: 10.1109/ICECOS47637.2019.8984528.
M. C. Wijaya, “Automatic Short Answer Grading System in Indonesian Language Using BERT Machine Learning,” Revue d’Intelligence Artificielle, vol. 35, no. 6, pp. 503–509, Dec. 2021, doi: 10.18280/RIA.350609.
M. Kharis, K. Laksono, and Suhartono, “Utilization of NLP-Technology in Current Applications for Education and Research by Indonesian Student, Teacher, and Lecturer,” Journal of Higher Education Theory and Practice, vol. 22, no. 14, pp. 170–178, Nov. 2022, doi: 10.33423/JHETP.V22I14.5544.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 International Journal of Electrical, Computer, and Biomedical Engineering

This work is licensed under a Creative Commons Attribution 4.0 International License.