An effective and efficient Web content extractor for optimizing the crawling process

Uzun, Erdinc; Guener, Edip Serdar; Kilicaslan, Yilmaz; Yerlikaya, Tarik; Agun, Hayri Volkan

An effective and efficient Web content extractor for optimizing the crawling process

dc.authorid	Uzun, Erdinç/0000-0003-4351-2244
dc.authorid	Agun, Hayri Volkan/0000-0002-4253-8920
dc.authorid	Yerlikaya, Tarik/0000-0002-9888-0151
dc.authorid	GUNER, EDIP SERDAR/0000-0002-7284-7513
dc.authorwosid	Yerlikaya, Tarık/AGP-6489-2022
dc.authorwosid	Uzun, Erdinç/AAG-5529-2019
dc.authorwosid	Agun, Hayri Volkan/P-5002-2019
dc.authorwosid	GUNER, EDIP SERDAR/A-1759-2016
dc.contributor.author	Uzun, Erdinc
dc.contributor.author	Guener, Edip Serdar
dc.contributor.author	Kilicaslan, Yilmaz
dc.contributor.author	Yerlikaya, Tarik
dc.contributor.author	Agun, Hayri Volkan
dc.date.accessioned	2024-06-12T11:17:14Z
dc.date.available	2024-06-12T11:17:14Z
dc.date.issued	2014
dc.department	Trakya Üniversitesi	en_US
dc.description.abstract	Classical Web crawlers make use of only hyperlink information in the crawling process. However, focused crawlers are intended to download only Web pages that are relevant to a given topic by utilizing word information before downloading the Web page. But, Web pages contain additional information that can be useful for the crawling process. We have developed a crawler, iCrawler (intelligent crawler), the backbone of which is a Web content extractor that automatically pulls content out of seven different blocks: menus, links, main texts, headlines, summaries, additional necessaries, and unnecessary texts from Web pages. The extraction process consists of two steps, which invoke each other to obtain information from the blocks. The first step learns which HTML tags refer to which blocks using the decision tree learning algorithm. Being guided by numerous sources of information, the crawler becomes considerably effective. It achieved a relatively high accuracy of 96.37% in our experiments of block extraction. In the second step, the crawler extracts content from the blocks using string matching functions. These functions along with the mapping between tags and blocks learned in the first step provide iCrawler with considerable time and storage efficiency. More specifically, iCrawler performs 14 times faster in the second step than in the first step. Furthermore, iCrawler significantly decreases storage costs by 57.10% when compared with the texts obtained through classical HTML stripping. Copyright (c) 2013 John Wiley & Sons, Ltd.	en_US
dc.identifier.doi	10.1002/spe.2195
dc.identifier.endpage	1199	en_US
dc.identifier.issn	0038-0644
dc.identifier.issn	1097-024X
dc.identifier.issue	10	en_US
dc.identifier.scopus	2-s2.0-84908473420	en_US
dc.identifier.scopusquality	Q2	en_US
dc.identifier.startpage	1181	en_US
dc.identifier.uri	https://doi.org/10.1002/spe.2195
dc.identifier.uri	https://hdl.handle.net/20.500.14551/24629
dc.identifier.volume	44	en_US
dc.identifier.wos	WOS:000341875200002	en_US
dc.identifier.wosquality	Q3	en_US
dc.indekslendigikaynak	Web of Science	en_US
dc.indekslendigikaynak	Scopus	en_US
dc.language.iso	en	en_US
dc.publisher	Wiley	en_US
dc.relation.ispartof	Software-Practice & Experience	en_US
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
dc.rights	info:eu-repo/semantics/closedAccess	en_US
dc.subject	Web Content Extraction	en_US
dc.subject	Web Crawling	en_US
dc.subject	Classification	en_US
dc.subject	Intelligent Systems	en_US
dc.subject	Searching Strategies	en_US
dc.title	An effective and efficient Web content extractor for optimizing the crawling process	en_US
dc.type	Article	en_US

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu

An effective and efficient Web content extractor for optimizing the crawling process

Dosyalar

Koleksiyon