What Is A Crawler or Web Spider?
A crawler (or spider web, web spider, or web crawler) is a software that automatically scans the Web. It is generally designed to collect resources (web pages, images, videos, Word documents, PDF or PostScript, etc.), to allow a search engine indexing. Operating on the same principle, some malicious robots (spambots) are used to archive resources or collect email addresses which send email.
Principles Of Indexing
To index of new resources, a crawler recursively following the links found from a central page. Subsequently, it is advantageous to store the URL of each resource retrieved and adjust the frequency of visits to the observed frequency of updating the resource. However, many resources beyond this recursive exploration, hyperlinks created on demand are found by a robot. This set of resources unexplored is sometimes called deep web.
An exclusion file (robots.txt) placed in the root of a website can give robots a list of resources to ignore. This agreement reduces server load and web resources to avoid uninteresting. By cons, some spam does not care about this file.
Two features of the web complicated the work of the spider: the volume of data and bandwidth. The processing capabilities and storage of computers and the number of internet users increased significantly, it linked to the development of maintenance tools pages of Web 2.0 allows anyone to easily put content online, the number and complexity of pages and multimedia items available, and their modification has significantly increased in the first decade of the twenty-first century.
The flow rate allowed by the bandwidth had not been an equivalent growth, and the problem becomes an ever increasing volume of information with relatively limited throughput. Robots therefore need to prioritize their downloads.
The behavior of a spider from a combination of the following principles:
- A principle of selection that determines which pages to download.
- A re-visit the principle that defines when to check if there are changes in the pages.
- A principle of politeness that defines how to avoid overloading web pages.
- A principle of parallelism that defines how to coordinate distributed indexing robots.
Robots of Web 3.0
Web 3.0 defines advanced technologies and new principles of research on the Internet that will rely in part on the Semantic Web standards. Robots of Web 3.0 exploit indexing methods involving associations man-machine smarter than those practiced today. The semantic web has nothing to do with semantics applied to languages. called semantic web, data architecture present in the cloud, namely the architecture of relationships and content on the web.
Robots or Free Robot
- GNU Wget is a free command line written in C automating transfers to an HTTP client.
- Heritrix is the robot archive of the Internet Archive. It was written in Java.
- HTTrack is an offline browser internet software that creates mirror websites for offline use. It is distributed under the GPL.
- Open Search Server is a spider web site. Published under the GPL, it builds on Lucene for indexing.
- Methabot is a robot with a system configuration. Published under ISC license.
- Nutch robot is a collection written in Java and released under Apache License. It can be used with the Lucene project of the Apache Foundation.
- Google Googlebot
- Scooter AltaVista
- OptimalSearch_Bot of Optimal Search
- MSN MSNBot
- Slurp Yahoo!
- KB Crawl KB CRAWL SAS
- OmniExplorer_Bot of OmniExplorer
- TwengaBot of Twenga
- ExaBot of Exalead
- MooveOnBot of mooveon.net
- GloObotBot of gloObot.com
- VerticrawlBot of Verticrawl
Study: From Wikipedia, the free encyclopedia. The text is available under the Creative Commons.
- Cloud Computing: The Concept and Examples of its Virtual Services | Part 1 - July 23, 2012
- Why Rapidly Growing Companies Need Cloud Computing | Part 1 - July 22, 2012
- Web Designing Process | Strategic Planning | Part 1 - August 7, 2011