TY - JOUR
T1 - A Multi-Modal Assessment Framework for Comparison of Specialized Deep Learning and General-Purpose Large Language Models
AU - Nadeem, Mohammad
AU - Sohail, Shahab Saquib
AU - Madsen, Dag Oivind
AU - Alzahrani, Ahmed Ibrahim
AU - Ser, Javier Del
AU - Muhammad, Khan
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2025
Y1 - 2025
N2 - Recent years have witnessed tremendous advancements in Al tools (e.g., ChatGPT, GPT-4, and Bard), driven by the growing power, reasoning, and efficiency of Large Language Models (LLMs). LLMs have been shown to excel in tasks ranging from poem writing and coding to essay generation and puzzle solving. Despite their proficiency in general queries, specialized tasks such as metaphor understanding and fake news detection often require finely tuned models, posing a comparison challenge with specialized Deep Learning (DL). We propose an assessment framework to compare task-specific intelligence with general-purpose LLMs on suicide and depression tendency identification. For this purpose, we trained two DL models on a suicide and depression detection dataset, followed by testing their performance on a test set. Afterward, the same test dataset is used to evaluate the performance of four LLMs (GPT-3.5, GPT-4, Google Bard, and MS Bing) using four classification metrics. The BERT-based DL model performed the best among all, with a testing accuracy of 94.61%, while GPT-4 was the runner-up with accuracy 92.5%. Results demonstrate that LLMs do not outperform the specialized DL models but are able to achieve comparable performance, making them a decent option for downstream tasks without specialized training.However,LLMs outperformed specialized models on the reduced dataset.
AB - Recent years have witnessed tremendous advancements in Al tools (e.g., ChatGPT, GPT-4, and Bard), driven by the growing power, reasoning, and efficiency of Large Language Models (LLMs). LLMs have been shown to excel in tasks ranging from poem writing and coding to essay generation and puzzle solving. Despite their proficiency in general queries, specialized tasks such as metaphor understanding and fake news detection often require finely tuned models, posing a comparison challenge with specialized Deep Learning (DL). We propose an assessment framework to compare task-specific intelligence with general-purpose LLMs on suicide and depression tendency identification. For this purpose, we trained two DL models on a suicide and depression detection dataset, followed by testing their performance on a test set. Afterward, the same test dataset is used to evaluate the performance of four LLMs (GPT-3.5, GPT-4, Google Bard, and MS Bing) using four classification metrics. The BERT-based DL model performed the best among all, with a testing accuracy of 94.61%, while GPT-4 was the runner-up with accuracy 92.5%. Results demonstrate that LLMs do not outperform the specialized DL models but are able to achieve comparable performance, making them a decent option for downstream tasks without specialized training.However,LLMs outperformed specialized models on the reduced dataset.
KW - assessment framework
KW - deep learning
KW - generative artificial intelligence
KW - Large language models
UR - https://www.scopus.com/pages/publications/85217059020
U2 - 10.1109/TBDATA.2025.3536937
DO - 10.1109/TBDATA.2025.3536937
M3 - Article
AN - SCOPUS:85217059020
SN - 2332-7790
VL - 11
SP - 1001
EP - 1012
JO - IEEE Transactions on Big Data
JF - IEEE Transactions on Big Data
IS - 3
ER -