An effective and efficient Web content extractor for optimizing the crawling process

dc.authoridUzun, Erdinç/0000-0003-4351-2244
dc.authoridAgun, Hayri Volkan/0000-0002-4253-8920
dc.authoridYerlikaya, Tarik/0000-0002-9888-0151
dc.authoridGUNER, EDIP SERDAR/0000-0002-7284-7513
dc.authorwosidYerlikaya, Tarık/AGP-6489-2022
dc.authorwosidUzun, Erdinç/AAG-5529-2019
dc.authorwosidAgun, Hayri Volkan/P-5002-2019
dc.authorwosidGUNER, EDIP SERDAR/A-1759-2016
dc.contributor.authorUzun, Erdinc
dc.contributor.authorGuener, Edip Serdar
dc.contributor.authorKilicaslan, Yilmaz
dc.contributor.authorYerlikaya, Tarik
dc.contributor.authorAgun, Hayri Volkan
dc.date.accessioned2024-06-12T11:17:14Z
dc.date.available2024-06-12T11:17:14Z
dc.date.issued2014
dc.departmentTrakya Üniversitesien_US
dc.description.abstractClassical Web crawlers make use of only hyperlink information in the crawling process. However, focused crawlers are intended to download only Web pages that are relevant to a given topic by utilizing word information before downloading the Web page. But, Web pages contain additional information that can be useful for the crawling process. We have developed a crawler, iCrawler (intelligent crawler), the backbone of which is a Web content extractor that automatically pulls content out of seven different blocks: menus, links, main texts, headlines, summaries, additional necessaries, and unnecessary texts from Web pages. The extraction process consists of two steps, which invoke each other to obtain information from the blocks. The first step learns which HTML tags refer to which blocks using the decision tree learning algorithm. Being guided by numerous sources of information, the crawler becomes considerably effective. It achieved a relatively high accuracy of 96.37% in our experiments of block extraction. In the second step, the crawler extracts content from the blocks using string matching functions. These functions along with the mapping between tags and blocks learned in the first step provide iCrawler with considerable time and storage efficiency. More specifically, iCrawler performs 14 times faster in the second step than in the first step. Furthermore, iCrawler significantly decreases storage costs by 57.10% when compared with the texts obtained through classical HTML stripping. Copyright (c) 2013 John Wiley & Sons, Ltd.en_US
dc.identifier.doi10.1002/spe.2195
dc.identifier.endpage1199en_US
dc.identifier.issn0038-0644
dc.identifier.issn1097-024X
dc.identifier.issue10en_US
dc.identifier.scopus2-s2.0-84908473420en_US
dc.identifier.scopusqualityQ2en_US
dc.identifier.startpage1181en_US
dc.identifier.urihttps://doi.org/10.1002/spe.2195
dc.identifier.urihttps://hdl.handle.net/20.500.14551/24629
dc.identifier.volume44en_US
dc.identifier.wosWOS:000341875200002en_US
dc.identifier.wosqualityQ3en_US
dc.indekslendigikaynakWeb of Scienceen_US
dc.indekslendigikaynakScopusen_US
dc.language.isoenen_US
dc.publisherWileyen_US
dc.relation.ispartofSoftware-Practice & Experienceen_US
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.rightsinfo:eu-repo/semantics/closedAccessen_US
dc.subjectWeb Content Extractionen_US
dc.subjectWeb Crawlingen_US
dc.subjectClassificationen_US
dc.subjectIntelligent Systemsen_US
dc.subjectSearching Strategiesen_US
dc.titleAn effective and efficient Web content extractor for optimizing the crawling processen_US
dc.typeArticleen_US

Dosyalar