TY - JOUR
T1 - Empirical Comparison and Analysis of Web-Based DNA N4-Methylcytosine Site Prediction Tools
AU - Manavalan, Balachandran
AU - Hasan, Md Mehedi
AU - Basith, Shaherin
AU - Gosu, Vijayakumar
AU - Shin, Tae Hwan
AU - Lee, Gwang
N1 - Publisher Copyright:
© 2020 The Author(s)
PY - 2020/12/4
Y1 - 2020/12/4
N2 - DNA N4-methylcytosine (4mC) is a crucial epigenetic modification involved in various biological processes. Accurate genome-wide identification of these sites is critical for improving our understanding of their biological functions and mechanisms. As experimental methods for 4mC identification are tedious, expensive, and labor-intensive, several machine learning-based approaches have been developed for genome-wide detection of such sites in multiple species. However, the predictions projected by these tools are difficult to quantify and compare. To date, no systematic performance comparison of 4mC tools has been reported. The aim of this study was to compare and critically evaluate 12 publicly available 4mC site prediction tools according to species specificity, based on a huge independent validation dataset. The tools 4mCCNN (Escherichia coli), DNA4mC-LIP (Arabidopsis thaliana), iDNA-MS (Fragaria vesca), DNA4mC-LIP and 4mCCNN (Drosophila melanogaster), and four tools for Caenorhabditis elegans achieved excellent overall performance compared with their counterparts. However, none of the existing methods was suitable for Geoalkalibacter subterraneus, Geobacter pickeringii, and Mus musculus, thereby limiting their practical applicability. Model transferability to five species and non-transferability to three species are also discussed. The presented evaluation will assist researchers in selecting appropriate prediction tools that best suit their purpose and provide useful guidelines for the development of improved 4mC predictors in the future.
AB - DNA N4-methylcytosine (4mC) is a crucial epigenetic modification involved in various biological processes. Accurate genome-wide identification of these sites is critical for improving our understanding of their biological functions and mechanisms. As experimental methods for 4mC identification are tedious, expensive, and labor-intensive, several machine learning-based approaches have been developed for genome-wide detection of such sites in multiple species. However, the predictions projected by these tools are difficult to quantify and compare. To date, no systematic performance comparison of 4mC tools has been reported. The aim of this study was to compare and critically evaluate 12 publicly available 4mC site prediction tools according to species specificity, based on a huge independent validation dataset. The tools 4mCCNN (Escherichia coli), DNA4mC-LIP (Arabidopsis thaliana), iDNA-MS (Fragaria vesca), DNA4mC-LIP and 4mCCNN (Drosophila melanogaster), and four tools for Caenorhabditis elegans achieved excellent overall performance compared with their counterparts. However, none of the existing methods was suitable for Geoalkalibacter subterraneus, Geobacter pickeringii, and Mus musculus, thereby limiting their practical applicability. Model transferability to five species and non-transferability to three species are also discussed. The presented evaluation will assist researchers in selecting appropriate prediction tools that best suit their purpose and provide useful guidelines for the development of improved 4mC predictors in the future.
KW - Bioinformatics
KW - DNA N-methylcytosine site
KW - Epigenetic modification
KW - Sequence-based features
KW - Systematic evaluation
KW - machine learning
KW - performance validation
KW - prediction model
KW - sequence analysis
KW - web server
UR - https://www.scopus.com/pages/publications/85092368140
U2 - 10.1016/j.omtn.2020.09.010
DO - 10.1016/j.omtn.2020.09.010
M3 - Article
AN - SCOPUS:85092368140
SN - 2162-2531
VL - 22
SP - 406
EP - 420
JO - Molecular Therapy Nucleic Acids
JF - Molecular Therapy Nucleic Acids
ER -