Enhancing Urdu hate speech detection through differential transfer learning and adaptive loss functions.
Ijaz Hussain, Muhammad Mahr Ali Arshad, Ammara Nawaz Cheema, Ibrahim M Almanjahie
Abstract
Open AccessHate speech detection is a challenging task due to complexities such as language ambiguity, limited context, cultural nuances, and situational factors. This challenge is further amplified in low-resource languages, i.e. Urdu. Most research on hate speech detection has focused primarily on resource rich language i.e. English, leaving Urdu significantly understudied. This paper presents a novel approach to enhance Urdu hate speech detection by leveraging differential transfer learning combined with adaptive loss functions. We utilize pre-trained models from resource-rich languages to capture semantic features relevant to hate speech and implement a differential transfer mechanism to adapt these models to the unique linguistic, and cultural characteristics of Urdu. We addressed cultural and linguistic differences by including specific datasets designed to suit certain cultures, using multilingual embeddings, and applying contextualization approaches that take into account the cultural specifics of language use. We created a Nastaliq Urdu dataset consisting of hate, offensive, and neutral labels for YouTube comments, totaling 18,058 records. To address class imbalance in the dataset, we propose an adaptive loss function that assigns higher penalties to misclassifications of hate speech, thereby improving model sensitivity toward this minority class. Our research employs a range of machine learning algorithms, including random forests, support vector machines, decision trees, recurrent neural networks, long short-term memory networks, and transfer learning methods. The results indicate that transfer learning outperforms conventional machine learning and deep learning techniques, improving F1 scores from 81% to above 89%. Notably, our proposed DAmBERT model achieved a weighted F1 score of 91.49% by incorporating pre-trained embeddings, outperforming all other classifiers. These findings highlight the potential of combining differential transfer learning and customized loss functions to develop robust hate speech detection systems for Urdu.