Unpacking the Reading Subtest of ProTEFL: A 4PL IRT Model Study

Lovieanta Arriza, Mark Lester B. Garcia, Aulia Noor Rizarni

Abstract


Reading comprehension is one of the key skills assessed in English language proficiency tests. However, many test items used in schools often fail to meet the expected psychometric standards, which can affect the accuracy of ability measurement. This study addresses the issue by evaluating the psychometric characteristics of the reading skill items in the ProTEFL instrument using an Item Response Theory (IRT) approach. This study contributes to the development of a high-quality standardized item bank for reading comprehension assessment and highlights the usefulness of the 4PL model for identifying problematic items. A quantitative descriptive method was employed through the analysis of 50 multiple-choice reading items from the ProTEFL. Responses from 8,038 test-takers were analyzed by checking sample adequacy, testing IRT assumptions, selecting model fits, and estimating item parameters. The results showed that the 4PL model provided the best fit (AIC = 468506.9; BIC = 469905.3; logLik = 234053.4), with 43 items meeting unidimensionality, local independence, and parameter invariance. About 70% of the items had good discrimination power, item difficulty was well distributed between 0 and 2 logits, pseudo-guessing values were mostly low, and 33% of items exhibited high upper asymptotes, indicating a risk of incorrect responses even by high-ability test-takers. These findings underscore the usefulness of the 4PL model in detecting flawed items and improving item quality. The study contributes to the refinement of the ProTEFL reading item bank and provides implications for developing valid and reliable language assessment instruments. Further research is recommended to apply multidimensional IRT models to cover other English language domains beyond reading.


Keywords


Item Response Theory; Four-Parameter Logistic Model; Reading Comprehension; ProTEFL

Full Text:

PDF

References


M. Camilleri, “Higher education marketing communications in the digital era,” in Strategic Marketing of Higher Education in Africa, pp. 77–95, 2020, https://doi.org/10.4324/9780429320934-7.

N. D.-N. Nguyen, A. H.-C. Leung, H.-T. Hien, and N.-T. Thiet, “The impact of virtual exchanges using english as a lingua franca on students’ linguistic and intercultural competence,” Teaching English as a Second or Foreign Language--TESL-EJ, vol. 28, no. 3, 2024, https://doi.org/10.55593/ej.28111a1.

S. Ehrenreich, “English as a business lingua franca in a German multinational corporation: Meeting the challenge,” Journal of Business Communication, vol. 47, no. 4, pp. 408–431, 2010, https://doi.org/10.1177/0021943610377303.

[4] J. Schmidgall, M. E. Oliveri, T. Duke, and E. Carter Grissom, “Justifying the construct definition for a new language proficiency assessment: the redesigned TOEIC Bridge® Tests - Framework paper,” ETS Research Report Series, vol. 2019, no. 1, pp. 1–20, Dec. 2019, https://doi.org/10.1002/ets2.12267.

[5] M. T. Hidayat, “English language proficiency and career opportunities: Perceptions of Indonesian university graduates,” Language Value, vol. 17, no. 1, pp. 85–107, 2024, https://doi.org/10.6035/languagev.7933.

S. Zein, D. Sukyadi, F. A. Hamied, and N. S. Lengkanawati, “English language education in Indonesia: A review of research (2011–2019),” Language Teaching, vol. 53, no. 4, pp. 491–523, 2020, https://doi.org/10.1017/S0261444820000208.

R. A. Mustopa, V. S. Damamaianti, Y. Mulyati, and D. S. Anshori, “Challenges of reading literacy assessment in the digital age,” pp. 408–416, 2024, https://doi.org/10.2991/978-94-6463-376-4_54.

B. R. Rush, D. C. Rankin, and B. J. White, “The impact of item-writing flaws and item complexity on examination item difficulty and discrimination value,” BMC Med Educ, vol. 16, no. 1, p. 250, 2016, https://doi.org/10.1186/s12909-016-0773-3.

M. L. B. Garcia, K. C. P. Santos, and C. P. Vistro-Yu, “Comparing two psychometric approaches: The case of item analysis for a classroom test in mathematics,” International Journal of Education and Practice, vol. 13, no. 1, pp. 327–344, 2025, https://doi.org/10.18488/61.v13i1.4060.

V. Mutiawani, A. M. Athaya, K. Saputra, and M. Subianto, “Implementing item response theory (IRT) method in quiz assessment system,” TEM Journal, vol. 11, no. 1, pp. 210–218, 2022, https://doi.org/10.18421/TEM111-26.

A. Gyamfi and R. Acquaye, “Parameters and models of item response theory (IRT): A review of literature,” Acta Educationis Generalis, vol. 13, no. 3, pp. 68–78, 2023, https://doi.org/10.2478/atd-2023-0022.

K. Hori, H. Fukuhara, and T. Yamada, “Item response theory and its applications in educational measurement Part I: Item response theory and its implementation in R,” WIREs Computational Statistics, vol. 14, no. 2, pp. 1–22, 2022, https://doi.org/10.1002/wics.1531.

A. Robitzsch, “On the choice of the item response model for scaling PISA data: Model selection based on information criteria and quantifying model uncertainty,” Entropy, vol. 24, no. 6, p. 760, 2022, https://doi.org/10.3390/e24060760.

Y.-L. Liaw, P. Baghaei, R. Strietholt, S. Meinck, and A. Strello, “Environmental knowledge: Conceptualization and measurement,” in Knowledge and Willingness to Act Pro-Environmentally, vol. 16, pp. 31–56, 2025, https://doi.org/10.1007/978-3-031-76033-4_4.

M. Ul Hassan and F. Miller, “Discrimination with unidimensional and multidimensional item response theory models for educational data,” Commun Stat Simul Comput, vol. 51, no. 6, pp. 2992–3012, 2019, https://doi.org/10.1080/03610918.2019.1705344.

R. Imtikhanah, E. Istiyono, and Widihastuti, “English final examination items befitting the criteria: An item response theory approach,” Eurasian Journal of Educational Research, vol. 106, no. 106, pp. 338–350, 2023, https://openurl.ebsco.com/EPDB%3Agcd%3A16%3A14621552/detailv2?sid=ebsco%3Aplink%3Ascholar&id=ebsco%3Agcd%3A175313292&crl=c&link_origin=scholar.google.com.

F. Antoniou, G. Alkhadim, A. Mouzaki, and P. Simos, “A psychometric analysis of raven’s colored progressive matrices: evaluating guessing and carelessness using the 4pl item response theory model,” J Intell, vol. 10, no. 1, p. 6, 2022, https://doi.org/10.3390/jintelligence10010006.

Z. S. Ibrahim, H. Retnawati, A. Irambona, and B. E. O. Pérez, “Stability of estimation item parameter in IRT dichotomy considering the number of participants,” REID (Research and Evaluation in Education), vol. 10, no. 1, pp. 114–127, 2024, https://doi.org/10.21831/reid.v10i1.73055.

L. Lina, D. Mardapi, and H. Haryanto, “Item characteristics on Pro-TEFL listening section,” in Proceedings of the First International Conference on Advances in Education, Humanities, and Language, ICEL 2019, Malang, Indonesia, 23-24 March 2019, 2019, https://doi.org/10.4108/eai.11-7-2019.159630.

Mohammad Umar Fakhrudin, P. W. Nugraha, N. Putri, and V. Nurfitriyani, “Investigating the bilingual education program at malhikdua school,” Journal of Literature Language and Academic Studies, vol. 3, no. 3, pp. 105–109, 2024, https://doi.org/10.56855/jllans.v3i3.1288.

A. Puente, A. P. Gutiérrez-de-Blume, J. Calderon, and L. Rojas, “Validación psicométrica de un test de comprensión lectora y conciencia metacognitiva en estudiantes universitarios,” Ocnos, vol. 24, no. 2, 2025, https://doi.org/10.18239/ocnos_2025.24.2.538.

T. Clark, L. Foster, and A. Bryman, “Analysing quantitative data,” in How to do your social research project or dissertation, pp. 233–268, 2019, https://doi.org/10.1093/hepl/9780198811060.003.0013.

H. Ahmad, N. Mamat, M. Che Mustafa, and S. Iryani Mohd Yusoff, “Validating the teaching, learning, and assessment quality of Malaysian ECCE instrument,” International Journal of Evaluation and Research in Education (IJERE), vol. 10, no. 1, p. 135, 2021, https://doi.org/10.11591/ijere.v10i1.20857.

W. Guo and Y.-J. Choi, “Assessing dimensionality of IRT models using traditional and revised parallel analyses,” Educ Psychol Meas, vol. 83, no. 3, pp. 609–629, 2023, https://doi.org/10.1177/00131644221111838.

M. Gökcan and D. Çobanoğlu Aktan, “Validation of the vocabulary size test,” Egit Psikol Olcme Deger Derg, vol. 13, no. 4, pp. 305–327, 2022, https://doi.org/10.21031/epod.1144808.

R. K. Hambleton, H. Swaminathan, and H. Jane. Rogers, Fundamentals of item response theory (vol. 2). Sage Publications. 1991. https://books.google.co.id/books?hl=id&lr=&id=gW05DQAAQBAJ.

D. Saepuzaman, H. Haryanto, Edi Istiyono, H. Retnawati, and Y. Yustiandi, “Analysis of items parameters on work and energy subtest using item response theory,” Jurnal Pendidikan MIPA, vol. 22, no. 1, pp. 1–9, 2021, https://doi.org/10.23960/jpmipa/v22i1.pp1-9.

M. Wilson, Constructing measures: An item response modeling approach. Routledge. 2023. https://doi.org/10.4324/9781003286929.

D. O. Tobih, M. A. Ayanwale, O. A. Ajayi, and M. V. Bolaji, “The use of measurement frameworks to explore the qualities of test items,” International Journal of Evaluation and Research in Education (IJERE), vol. 12, no. 2, pp. 914–923, 2023, https://doi.org/10.11591/ijere.v12i2.23747.

A. Buse, “The likelihood ratio, wald, and lagrange multiplier tests: An expository note,” Am Stat, vol. 36, no. 3, pp. 153–157, 1982, https://doi.org/10.2307/2683166.

M. Orlando and D. Thissen, “Likelihood-based item-fit Indices for dichotomous item response theory models,” Appl Psychol Meas, vol. 24, no. 1, pp. 50–64, 2000, https://doi.org/10.1177/01466216000241003.

M. A. Barton and F. M. Lord, “An upper asymptote for the three-parameter logistic item-response model,” ETS Research Report Series, vol. 1981, no. 1, pp. i–8, 1981, https://doi.org/10.1002/j.2333-8504.1981.tb01255.x.

C. DeMars, Item response theory. Oxford University Press, 2010. https://doi.org/10.1093/acprof:oso/9780195377033.001.0001.

R. K. Hambleton and H. Swaminathan, Item response theory: Principles and applications. Springer Netherlands, 1985. https://doi.org/10.1007/978-94-017-1988-9.

K. Świst, “Item analysis and evaluation using a four-parameter logistic model,” Edukacja, vol. 3, no. 134, pp. 77–97, 2015, https://www.ceeol.com/search/article-detail?id=471062.

A. Darmana, A. Sutiani, and Jasmidi, “Development of the thermochemistry- HOTS-tawheed multiple choice instrument,” J Phys Conf Ser, vol. 1462, no. 1, p. 012057, 2020, https://doi.org/10.1088/1742-6596/1462/1/012057.

N. Shrestha, “Factor analysis as a tool for survey Analysis,” Am J Appl Math Stat, vol. 9, no. 1, pp. 4–11, 2021, https://doi.org/10.12691/ajams-9-1-2.

R. Kumolohadi, F. Mangunsong, and J. Suleeman, “Development and validation of ethical awareness scale among university students,” Jurnal Pengukuran Psikologi dan Pendidikan Indonesia (JP3I), vol. 10, no. 1, pp. 20–31, 2021, https://doi.org/10.15408/jp3i.v10i1.17216.

U. Lorenzo-Seva and P. J. Ferrando, “MSA: The forgotten index for identifying inappropriate items before computing exploratory item factor analysis,” Methodology, vol. 17, no. 4, pp. 296–306, 2021, https://doi.org/10.5964/meth.7185.

N. D. Astuti, M. Hajaroh, Y. Prihatni, A. Setiawan, F. A. Setiawati, and H. Retnawati, “Comparison of KMO results, eigen value, reliability, and standard error of measurement: Original & rescaling through summated rating scaling,” JP3I (Jurnal Pengukuran Psikologi dan Pendidikan Indonesia), no. 2, pp. 199–215, 2024, https://doi.org/10.15408/jp3i.v13i2.36684.

P. H. Santoso, E. Istiyono, and H. Haryanto, “Principal component analysis and exploratory factor analysis of the mechanical waves conceptual survey,” JP3I (Jurnal Pengukuran Psikologi dan Pendidikan Indonesia), vol. 11, no. 2, pp. 209–225, 2022, https://doi.org/10.15408/jp3i.v11i2.27488.

L. Arriza, H. Retnawati, and R. T. Ayuni, “Item analysis of high school specialization mathematics exam questions with item response theory approach,” BAREKENG: Jurnal Ilmu Matematika dan Terapan, vol. 18, no. 1, pp. 0151–0162, Mar. 2024, https://doi.org/10.30598/barekengvol18iss1pp0151-0162.

A. Akbari, “The rasch analysis of item response theory: An untouched area in evaluating student academic translations,” SKASE Journal of Translation and Interpretation, vol. 18, no. 1, pp. 50–77, 2025, https://doi.org/10.33542/JTI2025-1-05.

W. Astuti and Adiwijaya, “Support vector machine and principal component analysis for microarray data classification,” J Phys Conf Ser, vol. 971, p. 012003, 2018, https://doi.org/10.1088/1742-6596/971/1/012003.

S. J. Haberman, S. Sinharay, and K. H. Chon, “Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions,” Psychometrika, vol. 78, no. 3, pp. 417–440, 2013, https://doi.org/10.1007/s11336-012-9305-1.

E. Moradi, Z. Ghabanchi, and R. Pishghadam, “Reading comprehension test fairness across gender and mode of learning: Insights from IRT-based differential item functioning analysis,” Language Testing in Asia, vol. 12, no. 1, p. 39, 2022, https://doi.org/10.1186/s40468-022-00192-3.

S. Min and V. Aryadoust, “A systematic review of item response theory in language assessment: Implications for the dimensionality of language ability,” Studies in Educational Evaluation, vol. 68, p. 100963, Mar. 2021, https://doi.org/10.1016/j.stueduc.2020.100963.

Y. Liu and A. Maydeu-Olivares, “Local dependence diagnostics in IRT modeling of binary data,” Educ Psychol Meas, vol. 73, no. 2, pp. 254–274, 2013, https://doi.org/10.1177/0013164412453841.

S. Noventa, S. Ye, A. Kelava, and A. Spoto, “On the identifiability of 3- and 4-parameter item response theory models from the perspective of knowledge space theory,” Psychometrika, vol. 89, no. 2, pp. 486–516, 2024, https://doi.org/10.1007/s11336-024-09950-z.

G. Rasch, Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research, 1993, https://eric.ed.gov/?id=ED419814.

A. Birnbaum, “Some latent trait models and their use in inferring an examinee’s ability,” in Statistical theories of mental test scores, ch. 5, p. 396, 1968, https://cir.nii.ac.jp/crid/1571698600108572800.

M. A. Sorrel, F. J. Abad, J. Olea, J. de la Torre, and J. R. Barrada, “Inferential item-fit evaluation in cognitive diagnosis modeling,” Appl Psychol Meas, vol. 41, no. 8, pp. 614–631, 2017, https://doi.org/10.1177/0146621617707510.

Z. Han, S. Sinharay, M. S. Johnson, and X. Liu, “The standardized S-X2 statistic for assessing item fit,” Appl Psychol Meas, vol. 47, no. 1, pp. 3–18, 2023, https://doi.org/10.1177/01466216221108077.

X. Meng, G. Xu, J. Zhang, and J. Tao, “Marginalized maximum a posteriori estimation for the four‐parameter logistic model under a mixture modelling framework,” British Journal of Mathematical and Statistical Psychology, vol. 73, no. S1, pp. 51–82, 2020, https://doi.org/10.1111/bmsp.12185.

N. G. Waller and L. Feuerstahler, “Bayesian modal estimation of the four-parameter item response model in real, realistic, and idealized data sets,” Multivariate Behav Res, vol. 52, no. 3, pp. 350–370, 2017, https://doi.org/10.1080/00273171.2017.1292893.

D. Magis, “A note on the item information function of the four-parameter logistic model,” Appl Psychol Meas, vol. 37, no. 4, pp. 304–315, 2013, https://doi.org/10.1177/0146621613475471.

J. Jumini and H. Retnawati, “Estimating item parameters and student abilities: An IRT 2PL analysis of mathematics examination,” AL-ISHLAH: Jurnal Pendidikan, vol. 14, no. 1, pp. 385–398, 2022, https://doi.org/10.35445/alishlah.v14i1.926.

H. Chin, C. M. Chew, W. Yew, and M. Musa, “Validating the cognitive diagnostic assessment and assessing students’ mastery of ‘parallel and perpendicular lines’ using the rasch model,” Participatory Educational Research, vol. 9, no. 6, pp. 436–452, 2022, https://doi.org/10.17275/per.22.147.9.6.

Herwin and S. C. Dahalan, “Person fit statistics to identify irrational response patterns for multiple-choice tests in learning evaluation,” Pegem Journal of Education and Instruction, vol. 12, no. 4, pp. 39–46, 2022, https://doi.org/10.47750/pegegog.12.04.05.

E. Ulitzsch, S. N. Yildirim‐Erbasli, G. Gorgun, and O. Bulut, “An explanatory mixture IRT model for careless and insufficient effort responding in self‐report measures,” British Journal of Mathematical and Statistical Psychology, vol. 75, no. 3, pp. 668–698, 2022, https://doi.org/10.1111/bmsp.12272.

S. Gören, H. Kara, B. Erdem Kara, and H. Kelecioğlu, “The effect of aberrant responses on ability estimation in computer adaptive tests,” Journal of Measurement and Evaluation in Education and Psychology, vol. 13, no. 3, pp. 256–268, 2022, https://doi.org/10.21031/epod.1067307.

W.-W. Liao, R.-G. Ho, Y.-C. Yen, and H.-C. Cheng, “The four-parameter logistic item response theory model as a robust method of estimating ability despite aberrant responses,” Social Behavior and Personality: an international journal, vol. 40, no. 10, pp. 1679–1694, 2012, https://doi.org/10.2224/sbp.2012.40.10.1679.

A. L. Nichols and J. E. Edlund, “Why don’t we care more about carelessness? Understanding the causes and consequences of careless participants,” Int J Soc Res Methodol, vol. 23, no. 6, pp. 625–638, 2020, https://doi.org/10.1080/13645579.2020.1719618.

E. Loken and K. L. Rulison, “Estimation of a four-parameter item response theory model,” British Journal of Mathematical and Statistical Psychology, vol. 63, no. 3, pp. 509–525, 2010, https://doi.org/10.1348/000711009X474502.




DOI: https://doi.org/10.59247/jtped.v2i2.27

Refbacks

  • There are currently no refbacks.


Copyright (c) 2025 Lovieanta Arriza, Mark Lester B. Garcia, Aulia Noor Rizarni

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.


Journal of Technological Pedagogy and Educational Development
ISSN: xxxx-xxxx
Organized by Peneliti Teknologi Teknik Indonesia
Published by Peneliti Teknologi Teknik Indonesia
Website: https://ejournal.jtped.org/ojs/index.php/jtped
Email: alfian_maarif@ieee.org
Address: Jl. Empu Sedah No. 12, Pringwulung, Condongcatur, Kec. Depok, Kabupaten Sleman, Daerah Istimewa Yogyakarta 55281, Indonesia