TY - JOUR
T1 - MLm5C
T2 - A high-precision human RNA 5-methylcytosine sites predictor based on a combination of hybrid machine learning models
AU - Kurata, Hiroyuki
AU - Harun-Or-Roshid, Md
AU - Mehedi Hasan, Md
AU - Tsukiyama, Sho
AU - Maeda, Kazuhiro
AU - Manavalan, Balachandran
N1 - Publisher Copyright:
© 2024 Elsevier Inc.
PY - 2024/7/1
Y1 - 2024/7/1
N2 - RNA modification serves as a pivotal component in numerous biological processes. Among the prevalent modifications, 5-methylcytosine (m5C) significantly influences mRNA export, translation efficiency and cell differentiation and are also associated with human diseases, including Alzheimer's disease, autoimmune disease, cancer, and cardiovascular diseases. Identification of m5C is critically responsible for understanding the RNA modification mechanisms and the epigenetic regulation of associated diseases. However, the large-scale experimental identification of m5C present significant challenges due to labor intensity and time requirements. Several computational tools, using machine learning, have been developed to supplement experimental methods, but identifying these sites lack accuracy and efficiency. In this study, we introduce a new predictor, MLm5C, for precise prediction of m5C sites using sequence data. Briefly, we evaluated eleven RNA sequence-derived features with four basic machine learning algorithms to generate baseline models. From these 44 models, we ranked them based on their performance and subsequently stacked the Top 20 baseline models as the best model, named MLm5C. The MLm5C outperformed the-state-of-the-art predictors. Notably, the optimization of the sequence length surrounding the modification sites significantly improved the prediction performance. MLm5C is an invaluable tool in accelerating the detection of m5C sites within the human genome, thereby facilitating in the characterization of their roles in post-transcriptional regulation.
AB - RNA modification serves as a pivotal component in numerous biological processes. Among the prevalent modifications, 5-methylcytosine (m5C) significantly influences mRNA export, translation efficiency and cell differentiation and are also associated with human diseases, including Alzheimer's disease, autoimmune disease, cancer, and cardiovascular diseases. Identification of m5C is critically responsible for understanding the RNA modification mechanisms and the epigenetic regulation of associated diseases. However, the large-scale experimental identification of m5C present significant challenges due to labor intensity and time requirements. Several computational tools, using machine learning, have been developed to supplement experimental methods, but identifying these sites lack accuracy and efficiency. In this study, we introduce a new predictor, MLm5C, for precise prediction of m5C sites using sequence data. Briefly, we evaluated eleven RNA sequence-derived features with four basic machine learning algorithms to generate baseline models. From these 44 models, we ranked them based on their performance and subsequently stacked the Top 20 baseline models as the best model, named MLm5C. The MLm5C outperformed the-state-of-the-art predictors. Notably, the optimization of the sequence length surrounding the modification sites significantly improved the prediction performance. MLm5C is an invaluable tool in accelerating the detection of m5C sites within the human genome, thereby facilitating in the characterization of their roles in post-transcriptional regulation.
KW - Baseline model
KW - Bioinformatics
KW - RNA 5-methylcytosine
KW - Sequence analysis
KW - Sequential forward search
UR - https://www.scopus.com/pages/publications/85192860237
U2 - 10.1016/j.ymeth.2024.05.004
DO - 10.1016/j.ymeth.2024.05.004
M3 - Article
C2 - 38729455
AN - SCOPUS:85192860237
SN - 1046-2023
VL - 227
SP - 37
EP - 47
JO - Methods
JF - Methods
ER -