A New Method for Short Text Compression

dc.authoridASLANYUREK, MURAT/0000-0002-3296-4395
dc.authoridMesut, Altan/0000-0002-1477-3093
dc.authorwosidASLANYUREK, Murat/HGD-8836-2022
dc.contributor.authorAslanyurek, Murat
dc.contributor.authorMesut, Altan
dc.date.accessioned2024-06-12T11:07:57Z
dc.date.available2024-06-12T11:07:57Z
dc.date.issued2023
dc.departmentTrakya Üniversitesien_US
dc.description.abstractShort texts cannot be compressed effectively with general-purpose compression methods. Methods developed to compress short texts often use static dictionaries. In order to achieve high compression ratios, using a static dictionary suitable for the text to be compressed is an important problem that needs to be solved. In this study, a method called WSDC (Word-based Static Dictionary Compression), which can compress short texts at a high ratio, and a model that uses iterative clustering to create static dictionaries used in this method are proposed. The number of static dictionaries to be created can vary by running the k-Means clustering algorithm iteratively according to some rules. A method called DSWF (Dictionary Selection by Word Frequency) is also presented to determine which of the created dictionaries can compress the source text at the best ratio. Wikipedia article abstracts consisting of 6 different languages were used as the dataset in the experiments. The developed WSDC method is compared with both general-purpose compression methods (Gzip, Bzip2, PPMd, Brotli and Zstd) and special methods used for compression of short texts (shoco, b64pack and smaz). According to the test results, although WSDC is slower than some other methods, it achieves the best compression ratios for short texts smaller than 200 bytes and better than other methods except Zstd for short texts smaller than 1000 bytes.en_US
dc.identifier.doi10.1109/ACCESS.2023.3340436
dc.identifier.endpage141035en_US
dc.identifier.issn2169-3536
dc.identifier.scopus2-s2.0-85179820259en_US
dc.identifier.scopusqualityQ1en_US
dc.identifier.startpage141022en_US
dc.identifier.urihttps://doi.org/10.1109/ACCESS.2023.3340436
dc.identifier.urihttps://hdl.handle.net/20.500.14551/22253
dc.identifier.volume11en_US
dc.identifier.wosWOS:001127417900001en_US
dc.identifier.wosqualityN/Aen_US
dc.indekslendigikaynakWeb of Scienceen_US
dc.indekslendigikaynakScopusen_US
dc.language.isoenen_US
dc.publisherIEEE-Inst Electrical Electronics Engineers Incen_US
dc.relation.ispartofIeee Accessen_US
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectMachine Learningen_US
dc.subjectText Categorizationen_US
dc.subjectText Compressionen_US
dc.subjectK-Meansen_US
dc.subjectClusteringen_US
dc.subjectLanguage Identificationen_US
dc.titleA New Method for Short Text Compressionen_US
dc.typeArticleen_US

Dosyalar