AN INVESTIGATION INTO THE PERFORMANCES OF SUPERVISED LEARNING ALGORITHMS IN DIFFERENT PHISHING DATASETS
Keywords:Phishing Classification, Phishing Datasets, Cyber Space Attacks, Learning Algorithms
Phishing techniques are employed by attackers to obtain the sensitive and confidential information of unsuspecting internet users. To stem the tide of phishing-based attacks in the cyber space, different machine learning techniques have been proposed as better alternatives to the signature-based approaches. This study used a different approach in the detection of phishing evidence in three phishing datasets. The focus of the study is to investigate the performances of Random Forest, K-Nearest Neighbour and ExtraTree algorithms in three different phishing datasets. The first two datasets of different sizes and features were obtained from Machine Learning UCI repository while the third dataset was collected from Mendeley. Exploratory Data Analysis was carried out in order to understand the nature of the features in the three datasets. Then, minimal dataset pre-processing was carried out on the features. A filter-based feature selection method called Anova F-test was used to select promising features that can improve classification performances of the selected learning algorithms. From all the experimental analyses, findings revealed that Random Forest has the overall best performances compared to the two other classifiers. Moreso, all the three algorithms had best average performances in the four metrics in the dataset provided by Tan (2018). The technique used in this study provided further insights in phishing detection studies.