A New Method for Short Text Compression

Aslanyurek, Murat; Mesut, Altan

A New Method for Short Text Compression

dc.authorid	ASLANYUREK, MURAT/0000-0002-3296-4395
dc.authorid	Mesut, Altan/0000-0002-1477-3093
dc.authorwosid	ASLANYUREK, Murat/HGD-8836-2022
dc.contributor.author	Aslanyurek, Murat
dc.contributor.author	Mesut, Altan
dc.date.accessioned	2024-06-12T11:07:57Z
dc.date.available	2024-06-12T11:07:57Z
dc.date.issued	2023
dc.department	Trakya Üniversitesi	en_US
dc.description.abstract	Short texts cannot be compressed effectively with general-purpose compression methods. Methods developed to compress short texts often use static dictionaries. In order to achieve high compression ratios, using a static dictionary suitable for the text to be compressed is an important problem that needs to be solved. In this study, a method called WSDC (Word-based Static Dictionary Compression), which can compress short texts at a high ratio, and a model that uses iterative clustering to create static dictionaries used in this method are proposed. The number of static dictionaries to be created can vary by running the k-Means clustering algorithm iteratively according to some rules. A method called DSWF (Dictionary Selection by Word Frequency) is also presented to determine which of the created dictionaries can compress the source text at the best ratio. Wikipedia article abstracts consisting of 6 different languages were used as the dataset in the experiments. The developed WSDC method is compared with both general-purpose compression methods (Gzip, Bzip2, PPMd, Brotli and Zstd) and special methods used for compression of short texts (shoco, b64pack and smaz). According to the test results, although WSDC is slower than some other methods, it achieves the best compression ratios for short texts smaller than 200 bytes and better than other methods except Zstd for short texts smaller than 1000 bytes.	en_US
dc.identifier.doi	10.1109/ACCESS.2023.3340436
dc.identifier.endpage	141035	en_US
dc.identifier.issn	2169-3536
dc.identifier.scopus	2-s2.0-85179820259	en_US
dc.identifier.scopusquality	Q1	en_US
dc.identifier.startpage	141022	en_US
dc.identifier.uri	https://doi.org/10.1109/ACCESS.2023.3340436
dc.identifier.uri	https://hdl.handle.net/20.500.14551/22253
dc.identifier.volume	11	en_US
dc.identifier.wos	WOS:001127417900001	en_US
dc.identifier.wosquality	N/A	en_US
dc.indekslendigikaynak	Web of Science	en_US
dc.indekslendigikaynak	Scopus	en_US
dc.language.iso	en	en_US
dc.publisher	IEEE-Inst Electrical Electronics Engineers Inc	en_US
dc.relation.ispartof	Ieee Access	en_US
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Machine Learning	en_US
dc.subject	Text Categorization	en_US
dc.subject	Text Compression	en_US
dc.subject	K-Means	en_US
dc.subject	Clustering	en_US
dc.subject	Language Identification	en_US
dc.title	A New Method for Short Text Compression	en_US
dc.type	Article	en_US

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu

A New Method for Short Text Compression

Dosyalar

Koleksiyon