Assessing Variations in Open Datasets for Training Large Language Models: Biases and Benchmarking

Authors

  • Vincent Koc Hyperthink

Keywords:

Open Datasets, Large Language Models, Biases, Benchmarking, Dataset Variations, NLP, Dataset Evaluation

Abstract

Open datasets are critical to the development and training of large language models (LLMs). However, variations in dataset composition often introduce biases that can impact model performance and reliability. This Article investigates the nature and extent of these variations, categorizes biases inherent in datasets, and examines their implications on LLM training. We also evaluate benchmarking standards currently employed to measure LLM performance and propose enhancements for a fairer and more inclusive evaluation framework. Through extensive experiments and analyses, we reveal the consequences of dataset heterogeneity and demonstrate practical strategies for mitigating biases. Our findings emphasize the importance of transparent dataset curation and robust benchmarking practices to ensure the ethical development of LLMs.

References

[1] R. Dutta, "Benchmarking stereotype bias and toxicity in large language models," University of Illinois at Urbana-Champaign, 2024.

[2] I. O. Gallegos et al., "Bias and fairness in large language models: A survey," Computational Linguistics, pp. 1-79, 2024.

[3] L. Gao et al., "The pile: An 800gb dataset of diverse text for language modeling," arXiv preprint arXiv:2101.00027, 2020.

[4] T. Wu, M. Terry, and C. J. Cai, "Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts," in Proceedings of the 2022 CHI conference on human factors in computing systems, 2022, pp. 1-22.

[5] C. Xu, S. Guan, D. Greene, and M. Kechadi, "Benchmark Data Contamination of Large Language Models: A Survey," arXiv preprint arXiv:2406.04244, 2024.

[6] A. R. Ives, P. E. Midford, and T. Garland Jr, "Within-species variation and measurement error in phylogenetic comparative methods," Systematic biology, vol. 56, no. 2, pp. 252-270, 2007.

[7] J. Umbrich, S. Neumaier, and A. Polleres, "Quality assessment and evolution of open data portals," in 2015 3rd international conference on future internet of things and cloud, 2015: IEEE, pp. 404-411.

[8] H. W. Vesper, G. L. Myers, and W. G. Miller, "Current practices and challenges in the standardization and harmonization of clinical laboratory tests," The American journal of clinical nutrition, vol. 104, pp. 907S-912S, 2016.

[9] A. Aapaoja and H. Haapasalo, "The challenges of standardization of products and processes in construction," in Proceedings of the 22nd Annual Conference of the International Group for Lean, 2014: Citeseer, pp. 983-993.

[10] M.-Y. Chan and S.-M. Wong, "A comparative analysis to evaluate bias and fairness across large language models with benchmarks," 2024.

[11] X. Li et al., "Benchmarking Bias in Large Language Models during Role-Playing," arXiv preprint arXiv:2411.00585, 2024.

[12] L. Yuan et al., "Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and LLMs evaluations," Advances in Neural Information Processing Systems, vol. 36, pp. 58478-58507, 2023.

[13] E. G. Martin, J. Law, W. Ran, N. Helbig, and G. S. Birkhead, "Evaluating the quality and usability of open data for public health research: a systematic review of data offerings on 3 open data platforms," Journal of Public Health Management and Practice, vol. 23, no. 4, pp. e5-e13, 2017.

[14] J. Dhamala et al., "Bold: Dataset and metrics for measuring biases in open-ended language generation," in Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, 2021, pp. 862-872.

Downloads

Published

2025-01-26

Similar Articles

<< < 1 2 

You may also start an advanced similarity search for this article.