SEO Israel - Professional SEO Services

SEO & Digital Marketing

 
 

Crawlers

 
SEO Israel » Search Engines » Crawlers

Additional Articles:
Search Engines
Walla
Advanced search
dmoz
Crawlers
Submitting a site
Anti Spam
robots.txt file
Robots Meta Tag
Crawlers
A robot friendly website
GoogleBot - Google's scan robot
Website scan levels
Limiting robot access

Search engines collect information about your site with scan robots (that are also called: Crawlers, Spiders and Robots) that constantly scan the Internet. These robots are actually programs (pretty primitive - although they are constantly improving) whose purpose is to download webpages into a database, to search for links to new pages, download the new pages and so on.

A tool for testing the number of pages on site that appear in search results:
Search Engine Saturation

A robot friendly website

Since robots are pretty primitive, they like simple sites. Websites that are based on advanced technologies risk not being understood by the robots. Technologies that you should avoid:

  • Flash: Today's robots can read only a few things that appear inside a Flash object. Entire websites based on one Flash will appear in search results as a single page, without most of the site's content. Flash is the number 1 enemy of search robots today.
  • Frames: A method that is gradually disappearing. The problem with this method is that the main URL stays the same, while the content switches inside the frame. Therefore, you cannot reach a certain page inside your website directly, and only the main page will be displayed. If an internal page is displayed, it will appear without the external frame. In any case, the results will not be good.
  • IFrames: This is a more up-to-date technology, but it still creates the abovementioned problem. The content switches inside the IFrame, and therefore the robot can't see the different contents. Highly not recommended. If the purpose is to have an internal scroll based frame (which contains code that is not imported from other pages), you better use the DIV or SPAN tags for which you can define scrolling through the CSS.
  • Dynamic pages with Session ID: Many websites use the Session ID variable in the dynamic page URL in order to track visitors on their site. This causes the robot to think that there is a new page that doesn't exist in its database (because the session ID is new). This type of pages will eventually disappear from the search results completely.
  • Requiring Cookies: Some websites require cookies in order to view the site. The search engine robots don't know how to produce cookies and therefore will not be able to read pages that require them. It's not that you shouldn't use cookies, just don't force the user to use them.
  • Using only JavaScript Links: Robots can only recognize kinds of links, and they do not trace JavaScript links. If there are only JavaScript links leading to a certain page - this page will not be displayed.

Lets repeat the basics - simple is good.

GoogleBot - Google's scan robot

Following are some points that you should know about Google's scan robot - GoogleBot:

Scanning frequency: GoogleBot scans pages and websites with a varying frequency. The parameters that GoogleBot uses in order to determine which pages should be scanned more often are the PageRrank (PR), the number of links leading to the page, and a number of URL parameters (e.g.: is it a dynamic php or asp page). There are obviously additional parameters, but these are hard to figure out.

Id parameter: GoogleBot may not scan dynamic sites that include a parameter called "id" since this variable is often used only for saving the session id. It is very possible that you should avoid using these two letters even inside a longer name (e.g.: catid) - but there is no concrete verification of this.

Website scan levels

Robots that scan the internet have three different main scan levels, as follows:

Scanning for new pages: This scan is performed in order to locate new pages that do not exist in the search engine database. The robot can "discover" the page if it was submitted through the search engine's "add a website" page, or if it encounters a links to the new page in one of the pages that already exist in its page database.

Shallow scan of important pages: This scan includes only the most important pages on the website (usually the homepage), and is performed more often.

Deep Scan: This scan includes all of the site pages that appear in the database in order to locate new pages and alterations to existing pages. This type of scan is not performed frequently.

Limiting robot access

Sometimes you will want to prevent search robots from accessing a certain area of your site. A basic example is a folder that you don't want to expose by mistake, or a page that is no longer updated.

There are two ways to block robots' access to certain areas of your site:

Robots.txt File

You will often be interested in blocking a certain search engine robot's access to your site (or to a part of it), or blocking access to a certain area from all robots. This is what the robots.txt file is used for.

Note: blocking a search engine from accessing a certain page will indeed prevent the content collection of that page, but sometimes, if there are links to that page from other pages that are not blocked, the page will still appear in the search results, but without its information (title, description etc). If you want to completely prevent this page from being displayed, you should use the other method (robots tag).

The robots.txt file should be placed in your site's root folder (usually, it will not appear there naturally, you need to create it). Each part of this file contains a type of robot and the limitations that apply to it. In addition, there may be limitations that apply to all robots.

Additional information regarding blocking access of search robots:
Building a robots.txt file

Robots Meta Tag

In order to control the way search robots process certain pages on your site, you can use the robots tag. This tag controls the following:

Whether or not to include the page in search engine results
Whether or not to follow outgoing links from this page

Additional information regarding blocking access of search robots:
Using the robots meta tag

A customer that couldn't find you is your competitor's customer..

Please fill in your site's details so that we can contact you as soon as possible and answer your questions:


SEO Israel, 1'st Hamada St., Rehovot 76703, Israel. (+972)-73-2240000
Site:
Email:
Phone:
Name:
Contact SEO Israel
Name:
Site:
E-mail:
Phone:
Phone:972-73-2240000
Fax:972-73-2240022
Info Center

Our Company:

About SEO Israel
SEO/SEM Jobs in Israel
Our Clients
Contact Us

Services:

Optimization Techniques
Link Development
Internet Marketing
Google Adsense
Web Analytics
Google AdWords

Search Engines Info:

Search Engines
Google

Setting Up a New Site:

Website Hosting
Domain Name Registration

Additional Resources:

SEO Blog (Hebrew)
הסמכת גוגל AdWords הסמכת גוגל אנליטיקס
Site Map, Hebrew Site Valig XHTML 1.0
SEO Israel Facebook Page SEO Israel on LinkedIn Follow SEO Israel on Twitter SEO Israel on Google+