Comparative Study of Feature Engineering Techniques for Predictive Data Analytics

Authors

  • Lukman Santoso Universitas Sains dan Teknologi Komputer, Semarang, Indonesia, 50192
  • Priyadi Universitas Sains dan Teknologi Komputer, Semarang, Indonesia, 50192

DOI:

https://doi.org/10.51903/jtie.v3i2.225

Keywords:

Feature Engineering, Machine Learning, Predictive Analytics

Abstract

In the rapidly evolving era of big data, predictive analytics has become a crucial approach in supporting data-driven decision-making across various sectors such as finance, healthcare, and marketing. However, the effectiveness of predictive models is highly dependent on the quality of features utilized in model training. This study aims to evaluate and compare various feature engineering techniques to enhance the accuracy of predictive models based on Random Forest (RF) and Extreme Gradient Boosting (XGBoost) algorithms. The research employs a quantitative experimental approach by applying different feature engineering techniques, including SHAP-based feature importance, Principal Component Analysis (PCA), and categorical variable encoding. The evaluation results indicate that the implementation of SHAP-based feature importance yields the best outcomes, with a Mean Squared Error (MSE) of 0.150 and a Root Mean Squared Error (RMSE) of 0.387 in the XGBoost model. These values outperform those without feature engineering, which recorded an MSE of 0.230 and an RMSE of 0.479. The combination of PCA and encoding techniques also shows a significant performance improvement with an MSE of 0.160 and an RMSE of 0.400. The XGBoost algorithm consistently demonstrates superior performance compared to RF across various testing scenarios. The contribution of this study lies in its recommendation of appropriate feature engineering techniques to improve the predictive quality of Machine Learning (ML)  models. This research provides insights for researchers and practitioners in developing more effective feature engineering strategies and opens opportunities for exploring advanced techniques in more complex data domains.

References

Akinola, O. O., Ezugwu, A. E., Agushaka, J. O., Zitar, R. A., & Abualigah, L. (2022). Multiclass Feature Selection with Metaheuristic Optimization Algorithms: A Review. In Neural Computing and Applications (Vol. 34, Issue 22). Springer London. https://doi.org/10.1007/s00521-022-07705-4

Alsahaf, A., Petkov, N., Shenoy, V., & Azzopardi, G. (2022). A Framework for Feature Selection through Boosting. Expert Systems with Applications, 187, 115895. https://doi.org/10.1016/j.eswa.2021.115895

Alzakari, S. A., Menaem, A. A., Omer, N., Abozeid, A., Hussein, L. F., Abass, I. A. M., Rami, A., & Elhadad, A. (2024). Enhanced Heart Disease Prediction in Remote Healthcare Monitoring Using IOT-Enabled Cloud-Based XGBoost and BI-LSTM. Alexandria Engineering Journal, 105, 280–291. https://doi.org/10.1016/j.aej.2024.06.036

André, P., Lu, S. C., & Sidey-Gibbons, C. (2022). Machine Learning in Medicine: A Practical Introduction to Techniques for Data Pre-Processing, Hyperparameter Tuning, and Model Comparison. BMC Medical Research Methodology, 22(1), 282. https://doi.org/10.1186/s12874-022-01758-8

Ben Jabeur, S., Stef, N., & Carmona, P. (2023). Bankruptcy Prediction Using the XGBoost Algorithm and Variable Importance Feature Engineering. Computational Economics, 61(2), 715–741. https://doi.org/10.1007/s10614-021-10227-1

Boeschoten, S., Catal, C., Tekinerdogan, B., Lommen, A., & Blokland, M. (2023). The Automation of the Development of Classification Models and Improvement of Model Quality Using Feature Engineering Techniques. Expert Systems with Applications, 213, 118912. https://doi.org/10.1016/j.eswa.2022.118912

Demir, S., & Şahin, E. K. (2022). Liquefaction Prediction with Robust Machine Learning Algorithms (SVM, RF, And XGBoost) Supported by Genetic Algorithm-Based Feature Selection and Parameter Optimization from the Perspective of Data Processing. Environmental Earth Sciences, 81(18), 1–17. https://doi.org/10.1007/s12665-022-10578-4/

Diapoldo Silalahi, F., Wijanarko, T., Putra, A., & Siswanto, E. (2022). Machine Learning Technique for Credit Card Scam Detection. Journal of Technology Informatics and Engineering, 1(1), 50–79. https://doi.org/10.51903/jtie.v1i1.143

Elansari, T., Ouanan, M., & Bourray, H. (2023). Mixed Radial Basis Function Neural Network Training Using Genetic Algorithm. Neural Processing Letters, 55(8), 10569–10587. https://doi.org/10.1007/s11063-023-11339-5

Gan, L. (2022). XGBoost-Based E-Commerce Customer Loss Prediction. Computational Intelligence and Neuroscience, 2022(1), 1858300. https://doi.org/10.1155/2022/1858300

Kavzoglu, T., & Teke, A. (2022). Advanced Hyperparameter Optimization for Improved Spatial Prediction of Shallow Landslides Using Extreme Gradient Boosting (XGBoost). Bulletin of Engineering Geology and the Environment, 81(5), 1–22. https://doi.org/10.1007/s10064-022-02708-w

Liu, X., Tang, H., Ding, Y., & Yan, D. (2022). Investigating the Performance of Machine Learning Models Combined with Different Feature Selection Methods to Estimate the Energy Consumption of Buildings. Energy and Buildings, 273, 112408. https://doi.org/10.1016/j.enbuild.2022.112408

Natras, R., Soja, B., & Schmidt, M. (2022). Ensemble Machine Learning of Random Forest, AdaBoost and XGBoost for Vertical Total Electron Content Forecasting. Remote Sensing, 14(15), 1–34. https://doi.org/10.3390/rs14153547

Orji, U., & Ukwandu, E. (2024). Machine Learning for an Explainable Cost Prediction of Medical Insurance. Machine Learning with Applications, 15, 100516. https://doi.org/10.1016/j.mlwa.2023.100516

Pargent, F., Pfisterer, F., Thomas, J., & Bischl, B. (2022). Regularized Target Encoding Outperforms Traditional Methods in Supervised Machine Learning with High Cardinality Features. Computational Statistics, 37(5), 2671–2692. https://doi.org/10.1007/s00180-022-01207-6

Priyadi, P., Migunani, M., & Sasmoko, D. (2024). Enhancing Big Data Processing Efficiency in AI-Based Healthcare Systems: A Comparative Analysis of Random Forest and Deep Learning. Journal of Technology Informatics and Engineering, 3(3), 263–278. https://doi.org/10.51903/jtie.v3i3.205

Pudjihartono, N., Fadason, T., Kempa-Liehr, A. W., & O’Sullivan, J. M. (2022). A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Frontiers in Bioinformatics, 2, 1–17. https://doi.org/10.3389/fbinf.2022.927312

Raharjo, B., Rudjiono, & Fitrianto, Y. (2024). Prediction and Detection of Scam Threats on Digital Platforms for Indonesian Users Using Machine Learning Models. Journal of Technology Informatics and Engineering, 3(3), 350–369. https://doi.org/10.51903/jtie.v3i3.208

Ren, K., Zeng, Y., Zhong, Y., Sheng, B., & Zhang, Y. (2023). MAFSIDS: A Reinforcement Learning-Based Intrusion Detection Model for Multi-Agent Feature Selection Networks. Journal of Big Data, 10(1), 137. https://doi.org/10.1186/s40537-023-00814-4

Sánchez-Hernández, S. E., Salido-Ruiz, R. A., Torres-Ramos, S., & Román-Godínez, I. (2022). Evaluation of Feature Selection Methods for Classification of Epileptic Seizure EEG Signals. Sensors, 22(8), 3066. https://doi.org/10.3390/s22083066

Santoso, J. T., Manongga, D., Setyawan, I., Purnomo, H. D., & Hendry. (2024). Exploring Data Analytics in Attendance Systems: Unveiling Machine Learning Techniques, Patterns, Practices, and Emerging Trends. Scientific Journal of Informatics, 11(2), 325–340. https://doi.org/10.15294/sji.v11i2.3438

Shafiei, A., Tatar, A., Rayhani, M., Kairat, M., & Askarova, I. (2022). Artificial Neural Network, Support Vector Machine, Decision Tree, Random Forest, and Committee Machine Intelligent System Help to Improve Performance Prediction of Low Salinity Water Injection in Carbonate Oil Reservoirs. Journal of Petroleum Science and Engineering, 219, 111046. https://doi.org/10.1016/j.petrol.2022.111046

Shantal, M., Othman, Z., & Bakar, A. A. (2023). A Novel Approach for Data Feature Weighting Using Correlation Coefficients and Min–Max Normalization. Symmetry, 15(12), 2185. https://doi.org/10.3390/sym15122185

Theng, D., & Bhoyar, K. K. (2024). Feature Selection Techniques for Machine Learning: A Survey of More than Two Decades of Research. Knowledge and Information Systems, 66(3), 1575–1637. https://doi.org/10.1007/s10115-023-02010-5

Verdonck, T., Baesens, B., Óskarsdóttir, M., & vanden Broucke, S. (2024). Special Issue on Feature Engineering Editorial. Machine Learning, 113(7), 3917–3928. https://doi.org/10.1007/s10994-021-06042-2

Wang, C. C., Kuo, P. H., & Chen, G. Y. (2022). Machine Learning Prediction of Turning Precision Using Optimized XGBoost Model. Applied Sciences (Switzerland), 12(15), 7793. https://doi.org/10.3390/app12157739

Yin, Y., Jang-Jaccard, J., Xu, W., Singh, A., Zhu, J., Sabrina, F., & Kwak, J. (2023). IGRF-RFE: A Hybrid Feature Selection Method for MLP-Based Network Intrusion Detection on UNSW-NB15 dataset. Journal of Big Data, 10(1), 15. https://doi.org/10.1186/s40537-023-00694-8

Downloads

Published

2024-08-21

How to Cite

Comparative Study of Feature Engineering Techniques for Predictive Data Analytics. (2024). Journal of Technology Informatics and Engineering, 3(2), 417-435. https://doi.org/10.51903/jtie.v3i2.225