Comparative Study of Feature Engineering Techniques for Predictive Data Analytics
DOI:
https://doi.org/10.51903/jtie.v3i2.225Keywords:
Feature Engineering, Machine Learning, Predictive AnalyticsAbstract
In the rapidly evolving era of big data, predictive analytics has become a crucial approach in supporting data-driven decision-making across various sectors such as finance, healthcare, and marketing. However, the effectiveness of predictive models is highly dependent on the quality of features utilized in model training. This study aims to evaluate and compare various feature engineering techniques to enhance the accuracy of predictive models based on Random Forest (RF) and Extreme Gradient Boosting (XGBoost) algorithms. The research employs a quantitative experimental approach by applying different feature engineering techniques, including SHAP-based feature importance, Principal Component Analysis (PCA), and categorical variable encoding. The evaluation results indicate that the implementation of SHAP-based feature importance yields the best outcomes, with a Mean Squared Error (MSE) of 0.150 and a Root Mean Squared Error (RMSE) of 0.387 in the XGBoost model. These values outperform those without feature engineering, which recorded an MSE of 0.230 and an RMSE of 0.479. The combination of PCA and encoding techniques also shows a significant performance improvement with an MSE of 0.160 and an RMSE of 0.400. The XGBoost algorithm consistently demonstrates superior performance compared to RF across various testing scenarios. The contribution of this study lies in its recommendation of appropriate feature engineering techniques to improve the predictive quality of Machine Learning (ML) models. This research provides insights for researchers and practitioners in developing more effective feature engineering strategies and opens opportunities for exploring advanced techniques in more complex data domains.
References
Akinola, O. O., Ezugwu, A. E., Agushaka, J. O., Zitar, R. A., & Abualigah, L. (2022). Multiclass Feature Selection with Metaheuristic Optimization Algorithms: A Review. In Neural Computing and Applications (Vol. 34, Issue 22). Springer London. https://doi.org/10.1007/s00521-022-07705-4
Alsahaf, A., Petkov, N., Shenoy, V., & Azzopardi, G. (2022). A Framework for Feature Selection through Boosting. Expert Systems with Applications, 187, 115895. https://doi.org/10.1016/j.eswa.2021.115895
Alzakari, S. A., Menaem, A. A., Omer, N., Abozeid, A., Hussein, L. F., Abass, I. A. M., Rami, A., & Elhadad, A. (2024). Enhanced Heart Disease Prediction in Remote Healthcare Monitoring Using IOT-Enabled Cloud-Based XGBoost and BI-LSTM. Alexandria Engineering Journal, 105, 280–291. https://doi.org/10.1016/j.aej.2024.06.036
André, P., Lu, S. C., & Sidey-Gibbons, C. (2022). Machine Learning in Medicine: A Practical Introduction to Techniques for Data Pre-Processing, Hyperparameter Tuning, and Model Comparison. BMC Medical Research Methodology, 22(1), 282. https://doi.org/10.1186/s12874-022-01758-8
Ben Jabeur, S., Stef, N., & Carmona, P. (2023). Bankruptcy Prediction Using the XGBoost Algorithm and Variable Importance Feature Engineering. Computational Economics, 61(2), 715–741. https://doi.org/10.1007/s10614-021-10227-1
Boeschoten, S., Catal, C., Tekinerdogan, B., Lommen, A., & Blokland, M. (2023). The Automation of the Development of Classification Models and Improvement of Model Quality Using Feature Engineering Techniques. Expert Systems with Applications, 213, 118912. https://doi.org/10.1016/j.eswa.2022.118912
Demir, S., & Şahin, E. K. (2022). Liquefaction Prediction with Robust Machine Learning Algorithms (SVM, RF, And XGBoost) Supported by Genetic Algorithm-Based Feature Selection and Parameter Optimization from the Perspective of Data Processing. Environmental Earth Sciences, 81(18), 1–17. https://doi.org/10.1007/s12665-022-10578-4/
Diapoldo Silalahi, F., Wijanarko, T., Putra, A., & Siswanto, E. (2022). Machine Learning Technique for Credit Card Scam Detection. Journal of Technology Informatics and Engineering, 1(1), 50–79. https://doi.org/10.51903/jtie.v1i1.143
Elansari, T., Ouanan, M., & Bourray, H. (2023). Mixed Radial Basis Function Neural Network Training Using Genetic Algorithm. Neural Processing Letters, 55(8), 10569–10587. https://doi.org/10.1007/s11063-023-11339-5
Gan, L. (2022). XGBoost-Based E-Commerce Customer Loss Prediction. Computational Intelligence and Neuroscience, 2022(1), 1858300. https://doi.org/10.1155/2022/1858300
Kavzoglu, T., & Teke, A. (2022). Advanced Hyperparameter Optimization for Improved Spatial Prediction of Shallow Landslides Using Extreme Gradient Boosting (XGBoost). Bulletin of Engineering Geology and the Environment, 81(5), 1–22. https://doi.org/10.1007/s10064-022-02708-w
Liu, X., Tang, H., Ding, Y., & Yan, D. (2022). Investigating the Performance of Machine Learning Models Combined with Different Feature Selection Methods to Estimate the Energy Consumption of Buildings. Energy and Buildings, 273, 112408. https://doi.org/10.1016/j.enbuild.2022.112408
Natras, R., Soja, B., & Schmidt, M. (2022). Ensemble Machine Learning of Random Forest, AdaBoost and XGBoost for Vertical Total Electron Content Forecasting. Remote Sensing, 14(15), 1–34. https://doi.org/10.3390/rs14153547
Orji, U., & Ukwandu, E. (2024). Machine Learning for an Explainable Cost Prediction of Medical Insurance. Machine Learning with Applications, 15, 100516. https://doi.org/10.1016/j.mlwa.2023.100516
Pargent, F., Pfisterer, F., Thomas, J., & Bischl, B. (2022). Regularized Target Encoding Outperforms Traditional Methods in Supervised Machine Learning with High Cardinality Features. Computational Statistics, 37(5), 2671–2692. https://doi.org/10.1007/s00180-022-01207-6
Priyadi, P., Migunani, M., & Sasmoko, D. (2024). Enhancing Big Data Processing Efficiency in AI-Based Healthcare Systems: A Comparative Analysis of Random Forest and Deep Learning. Journal of Technology Informatics and Engineering, 3(3), 263–278. https://doi.org/10.51903/jtie.v3i3.205
Pudjihartono, N., Fadason, T., Kempa-Liehr, A. W., & O’Sullivan, J. M. (2022). A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Frontiers in Bioinformatics, 2, 1–17. https://doi.org/10.3389/fbinf.2022.927312
Raharjo, B., Rudjiono, & Fitrianto, Y. (2024). Prediction and Detection of Scam Threats on Digital Platforms for Indonesian Users Using Machine Learning Models. Journal of Technology Informatics and Engineering, 3(3), 350–369. https://doi.org/10.51903/jtie.v3i3.208
Ren, K., Zeng, Y., Zhong, Y., Sheng, B., & Zhang, Y. (2023). MAFSIDS: A Reinforcement Learning-Based Intrusion Detection Model for Multi-Agent Feature Selection Networks. Journal of Big Data, 10(1), 137. https://doi.org/10.1186/s40537-023-00814-4
Sánchez-Hernández, S. E., Salido-Ruiz, R. A., Torres-Ramos, S., & Román-Godínez, I. (2022). Evaluation of Feature Selection Methods for Classification of Epileptic Seizure EEG Signals. Sensors, 22(8), 3066. https://doi.org/10.3390/s22083066
Santoso, J. T., Manongga, D., Setyawan, I., Purnomo, H. D., & Hendry. (2024). Exploring Data Analytics in Attendance Systems: Unveiling Machine Learning Techniques, Patterns, Practices, and Emerging Trends. Scientific Journal of Informatics, 11(2), 325–340. https://doi.org/10.15294/sji.v11i2.3438
Shafiei, A., Tatar, A., Rayhani, M., Kairat, M., & Askarova, I. (2022). Artificial Neural Network, Support Vector Machine, Decision Tree, Random Forest, and Committee Machine Intelligent System Help to Improve Performance Prediction of Low Salinity Water Injection in Carbonate Oil Reservoirs. Journal of Petroleum Science and Engineering, 219, 111046. https://doi.org/10.1016/j.petrol.2022.111046
Shantal, M., Othman, Z., & Bakar, A. A. (2023). A Novel Approach for Data Feature Weighting Using Correlation Coefficients and Min–Max Normalization. Symmetry, 15(12), 2185. https://doi.org/10.3390/sym15122185
Theng, D., & Bhoyar, K. K. (2024). Feature Selection Techniques for Machine Learning: A Survey of More than Two Decades of Research. Knowledge and Information Systems, 66(3), 1575–1637. https://doi.org/10.1007/s10115-023-02010-5
Verdonck, T., Baesens, B., Óskarsdóttir, M., & vanden Broucke, S. (2024). Special Issue on Feature Engineering Editorial. Machine Learning, 113(7), 3917–3928. https://doi.org/10.1007/s10994-021-06042-2
Wang, C. C., Kuo, P. H., & Chen, G. Y. (2022). Machine Learning Prediction of Turning Precision Using Optimized XGBoost Model. Applied Sciences (Switzerland), 12(15), 7793. https://doi.org/10.3390/app12157739
Yin, Y., Jang-Jaccard, J., Xu, W., Singh, A., Zhu, J., Sabrina, F., & Kwak, J. (2023). IGRF-RFE: A Hybrid Feature Selection Method for MLP-Based Network Intrusion Detection on UNSW-NB15 dataset. Journal of Big Data, 10(1), 15. https://doi.org/10.1186/s40537-023-00694-8
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Journal of Technology Informatics and Engineering

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

