Makine öğrenmesi ile elde edilen statik sözlükleri kullanarak kısa metin sıkıştırma

Aslanyürek, Murat

Makine öğrenmesi ile elde edilen statik sözlükleri kullanarak kısa metin sıkıştırma

dc.contributor.advisor	Mesut, Altan
dc.contributor.author	Aslanyürek, Murat
dc.date.accessioned	2024-06-11T20:39:22Z
dc.date.available	2024-06-11T20:39:22Z
dc.date.issued	2021
dc.department	Enstitüler, Fen Bilimleri Enstitüsü, Hesaplamalı Bilimler Ana Bilim Dalı	en_US
dc.description	Doktora	en_US
dc.description.abstract	Bu tez çalışmasında kısa metinleri yüksek oranda sıkıştırmak için statik sözlük kullanan Statik Sözlük Sıkıştırma (SDC: Static Dictionary Compression) yöntemi ve bu yöntemde kullanılacak statik sözlüklerin oluşturulması için yinelemeli olarak kümeleme işlemi yapan bir model önerilmiştir. Bu modelde oluşturulacak statik sözlük sayısı, bir sınıflandırma algoritması ve bazı kurallara dayalı olarak belirlenir. Statik sözlüklerin oluşturulması için önerilen modelde kullanılmak üzere en uygun kümeleme ve sınıflandırma yöntemlerini belirlenmek amacı ile 6 farklı dildeki Wikipedia makale özetlerinden oluşan metinler boyutlarına göre her dil için 5 farklı gruba ayrılmıştır. Test edilen BIRCH, k-Ortalama, Ortalama Bağlantı, Tam Bağlantı, Tek Bağlantı ve Ward kümeleme yöntemlerinden hem kümeleme hızı hem de kümeleme başarısı olarak k-Ortalama yönteminin en uygun olduğu görülmüştür. Kümeleme performansını ölçmek için metinlerin sıkıştırılma oranının kullanılabileceği gösterilerek, kümeleme performansını ölçmek için yeni bir ölçüt olan Sıkıştırma Oranı İndeksi (SOİ) de önerilmiştir. En uygun dile göre sınıflandırma yöntemini belirlemek için ise birçok makine öğrenmesi yöntemi, Kelime Tabanlı İstatistiksel Yöntem (KTİY), fasttext ve langdetect sınıflandırma yöntemleri test edilmiştir. Geliştirilen metin sıkıştırma yönteminin kısa metinleri dile göre sınıflandıran ilk aşaması için en uygun ve en hızlı sınıflandırma yönteminin KTİY olduğu yapılan testler ile belirlenmiştir. SDC, 5 farklı boyut grubundan oluşan veri setleri kullanılarak Gzip, Bzip2, Zstd ve PPMd veri sıkıştırma yöntemleri ile karşılaştırılmıştır. SDC'nin diğer yöntemlerle birlikte kullanılmasının sıkıştırma oranı üzerindeki etkisi de araştırılmıştır. '0-199' ve '200-499' bayt boyutundaki kısa metinlerde SDC diğer yöntemlerden daha iyi sıkıştırma oranları vermiş, '500-999', '1000-1999' ve '2000 üstü' boyut gruplarında ise diğer yöntemlerin sıkıştırma oranını arttırmıştır. SDC ayrıca kısa metinleri sıkıştırmaya özgü yöntemler olan shoco, b64pack ve smaz yöntemleri, statik sözlük kullanması ile kısa metinlerde başarılı olan genel amaçlı sıkıştırma algoritması Brotli ve eğitim ile oluşturulan statik sözlüğü kullanan Zstd versiyonu ile de karşılaştırılmıştır. Sıkıştırma oranı açısından Zstd hariç diğer yöntemlere üstünlük sağlayabilmiştir.	en_US
dc.description.abstract	In this thesis, Static Dictionary Compression (SDC) method, which uses a static dictionary to compress short texts with high ratio, and a model that performs recursive clustering to create static dictionaries to be used in this method are proposed. The number of static dictionaries to be created in this model is determined based on a classification algorithm and some rules. The texts consisting of Wikipedia article abstracts in 6 different languages have been divided into 5 different groups for each language according to their size in order to determine the most appropriate clustering and classification methods that will be used in the proposed model to create the static dictionaries. Among the tested BIRCH, k-Means, Average Connection, Full Connection, Single Link and Ward clustering methods, the k-Means method was found to be the most appropriate in terms of both clustering speed and clustering success. By showing that the compression ratio of texts can be used to measure clustering performance, a new metric to measure clustering performance, the Compression Ratio Index (CRI), is also proposed. In order to determine the most appropriate language classification method, many machine learning methods, Word Based Statistical Method (WBSM), fasttext and langdetect classification methods were tested. It has been determined by the tests that the most appropriate and fastest classification method is WBSM for the first stage of the developed text compression method, which classifies short texts according to language. SDC is compared with Gzip, Bzip2, Zstd and PPMd data compression methods using datasets consisting of 5 different size groups. The effect of using SDC with other methods on the compression ratio was also investigated. SDC gives better compression ratios than other methods in '0-199' and '200-499' byte size short texts and increases the compression ratio of other compression methods in '500-999', '1000-1999' and 'over 2000' size groups. SDC has also been compared with the shoco, b64pack and smaz methods, which are specific methods for compressing short texts, the general-purpose compression algorithm Brotli, which is successful in short texts with the use of a static dictionary, and a Zstd version that uses static dictionary created by training. In terms of compression ratio, it was able to outperform other methods except Zstd.	en_US
dc.identifier.endpage	158	en_US
dc.identifier.startpage	1	en_US
dc.identifier.uri	https://tez.yok.gov.tr/UlusalTezMerkezi/TezGoster?key=tqUiYt63sTQLTpozMJ92QvetTV_AlOrd_OmtQ057xIVzvw0ViMmLiOrj_Cbo92IL
dc.identifier.uri	https://hdl.handle.net/20.500.14551/9609
dc.identifier.yoktezid	696090	en_US
dc.institutionauthor	Aslanyürek, Murat
dc.language.iso	tr	en_US
dc.publisher	Trakya Üniversitesi	en_US
dc.relation.publicationcategory	Tez	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol	en_US
dc.subject	Computer Engineering and Computer Science and Control	en_US
dc.title	Makine öğrenmesi ile elde edilen statik sözlükleri kullanarak kısa metin sıkıştırma	en_US
dc.title.alternative	Short text compression using static dictionaries obtained by machine learning	en_US
dc.type	Doctoral Thesis	en_US

Koleksiyon

Fen Bilimleri Enstitüsü Tez Koleksiyonu

Makine öğrenmesi ile elde edilen statik sözlükleri kullanarak kısa metin sıkıştırma

Dosyalar

Koleksiyon