A New Method for Short Text Compression

Aslanyurek, Murat; Mesut, Altan

A New Method for Short Text Compression

Tarih

2023

Yazarlar

Aslanyurek, Murat

Mesut, Altan

Yayıncı

IEEE-Inst Electrical Electronics Engineers Inc

Erişim Hakkı

info:eu-repo/semantics/openAccess

Özet

Short texts cannot be compressed effectively with general-purpose compression methods. Methods developed to compress short texts often use static dictionaries. In order to achieve high compression ratios, using a static dictionary suitable for the text to be compressed is an important problem that needs to be solved. In this study, a method called WSDC (Word-based Static Dictionary Compression), which can compress short texts at a high ratio, and a model that uses iterative clustering to create static dictionaries used in this method are proposed. The number of static dictionaries to be created can vary by running the k-Means clustering algorithm iteratively according to some rules. A method called DSWF (Dictionary Selection by Word Frequency) is also presented to determine which of the created dictionaries can compress the source text at the best ratio. Wikipedia article abstracts consisting of 6 different languages were used as the dataset in the experiments. The developed WSDC method is compared with both general-purpose compression methods (Gzip, Bzip2, PPMd, Brotli and Zstd) and special methods used for compression of short texts (shoco, b64pack and smaz). According to the test results, although WSDC is slower than some other methods, it achieves the best compression ratios for short texts smaller than 200 bytes and better than other methods except Zstd for short texts smaller than 1000 bytes.

Anahtar Kelimeler

Machine Learning, Text Categorization, Text Compression, K-Means, Clustering, Language Identification

Kaynak

Ieee Access

WoS Q Değeri

N/A

Scopus Q Değeri

Q1

Cilt

11

Bağlantı

https://doi.org/10.1109/ACCESS.2023.3340436
https://hdl.handle.net/20.500.14551/22253

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu

Detaylı Öğe Kaydı

A New Method for Short Text Compression

Tarih

Yazarlar

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Erişim Hakkı

Özet

Açıklama

Anahtar Kelimeler

Kaynak

WoS Q Değeri

Scopus Q Değeri

Cilt

Sayı

Künye

Bağlantı

Koleksiyon