MLSeq: Machine learning interface for RNA-sequencing data

dc.authoridZARARSIZ, GOKMEN/0000-0001-5801-1835
dc.authoridZararsız, Gözde/0000-0002-5495-7540
dc.authoridZararsız, Gökmen/0000-0001-5801-1835
dc.authoridGOKSULUK, DINCER/0000-0002-2752-7668
dc.authoridOzturk, Ahmet/0000-0002-7130-5624
dc.authoridKorkmaz, Selçuk/0000-0003-4632-6850
dc.authoridOZCETIN, Erdener/0000-0002-6079-3159
dc.authorwosidZARARSIZ, GOKMEN/ABH-7959-2020
dc.authorwosidELDEM, VAHAP/A-9160-2018
dc.authorwosidÖzçetin, Erdener/JCE-4183-2023
dc.authorwosidZararsız, Gözde/AAH-2073-2019
dc.authorwosidZararsız, Gökmen/E-8818-2013
dc.authorwosidGOKSULUK, DINCER/E-9175-2013
dc.authorwosidÖzçetin, Erdener/AAP-8037-2021
dc.contributor.authorGoksuluk, Dincer
dc.contributor.authorZararsiz, Gokmen
dc.contributor.authorKorkmaz, Selcuk
dc.contributor.authorEldem, Vahap
dc.contributor.authorZararsiz, Gozde Erturk
dc.contributor.authorOzcetin, Erdener
dc.contributor.authorOzturk, Ahmet
dc.date.accessioned2024-06-12T11:13:43Z
dc.date.available2024-06-12T11:13:43Z
dc.date.issued2019
dc.departmentTrakya Üniversitesien_US
dc.description.abstractBackground and Objective: In the last decade, RNA-sequencing technology has become method-of-choice and prefered to microarray technology for gene expression based classification and differential expression analysis since it produces less noisy data. Although there are many algorithms proposed for microarray data, the number of available algorithms and programs are limited for classification of RNA-sequencing data. For this reason, we developed MLSeq, to bring not only frequently used classification algorithms but also novel approaches together and make them available to be used for classification of RNA sequencing data. This package is developed using R language environment and distributed through BIOCONDUCTOR network. Methods: Classification of RNA-sequencing data is not straightforward since raw data should be preprocessed before downstream analysis. With MLSeq package, researchers can easily preprocess (normalization, filtering, transformation etc.) and classify raw RNA-sequencing data using two strategies: (i) to perform algorithms which are directly proposed for RNA-sequencing data structure or (ii) to transform RNA-sequencing data in order to bring it distributionally closer to microarray data structure, and perform algorithms which are developed for microarray data. Moreover, we proposed novel algorithms such as voom (an acronym for variance modelling at observational level) based nearest shrunken centroids (voomNSC), diagonal linear discriminant analysis (voomDLDA), etc. through MLSeq. Materials: Three real RNA-sequencing datasets (i.e cervical cancer, lung cancer and aging datasets) were used to evalute model performances. Poisson linear discriminant analysis (PLDA) and negative binomial linear discriminant analysis (NBLDA) were selected as algorithms based on dicrete distributions, and voomNSC, nearest shrunken centroids (NSC) and support vector machines (SVM) were selected as algorithms based on continuous distributions for model comparisons. Each algorithm is compared using classification accuracies and sparsities on an independent test set. Results: The algorithms which are based on discrete distributions performed better in cervical cancer and aging data with accuracies above 0.92. In lung cancer data, the most of algorithms performed similar with accuracies of 0.88 except that SVM achieved 0.94 of accuracy. Our voomNSC algorithm was the most sparse algorithm, and able to select 2.2% and 6.6% of all features for cervical cancer and lung cancer datasets respectively. However, in aging data, sparse classifiers were not able to select an optimal subset of all features. Conclusion: MLSeq is comprehensive and easy-to-use interface for classification of gene expression data. It allows researchers perform both preprocessing and classification tasks through single platform. With this property, MLSeq can be considered as a pipeline for the classification of RNA-sequencing data. (C) 2019 Elsevier B.V. All rights reserved.en_US
dc.description.sponsorshipResearch Fund of Erciyes University [TDK-2015-5468]en_US
dc.description.sponsorshipWe would like to thank S. Anders, M. I. Love and the BIOCONDUCTOR team for their useful comments, suggestions and support to improve the usefulness of the package. This study was supported by the Research Fund of Erciyes University [TDK-2015-5468]. We also would like to thank Sumeet Pal Singh for sharing the aging dataset which is used in the illustration of MLSeq package. Finally, we would like to thank anonymous reviewers for their contribution and valuable comments.en_US
dc.identifier.doi10.1016/j.cmpb.2019.04.007
dc.identifier.endpage231en_US
dc.identifier.issn0169-2607
dc.identifier.issn1872-7565
dc.identifier.pmid31104710en_US
dc.identifier.scopus2-s2.0-85064937612en_US
dc.identifier.scopusqualityQ1en_US
dc.identifier.startpage223en_US
dc.identifier.urihttps://doi.org/10.1016/j.cmpb.2019.04.007
dc.identifier.urihttps://hdl.handle.net/20.500.14551/23665
dc.identifier.volume175en_US
dc.identifier.wosWOS:000468033700021en_US
dc.identifier.wosqualityQ1en_US
dc.indekslendigikaynakWeb of Scienceen_US
dc.indekslendigikaynakScopusen_US
dc.indekslendigikaynakPubMeden_US
dc.language.isoenen_US
dc.publisherElsevier Ireland Ltden_US
dc.relation.ispartofComputer Methods And Programs In Biomedicineen_US
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.rightsinfo:eu-repo/semantics/closedAccessen_US
dc.subjectRNA-Sequencingen_US
dc.subjectClassificationen_US
dc.subjectNegative Binomialen_US
dc.subjectPoissonen_US
dc.subjectLinear Discriminant Analysisen_US
dc.subjectShrunken Centroidsen_US
dc.subjectSeqen_US
dc.subjectClassificationen_US
dc.subjectRevealsen_US
dc.subjectPackageen_US
dc.titleMLSeq: Machine learning interface for RNA-sequencing dataen_US
dc.typeArticleen_US

Dosyalar