ProxyPy’s crawl process starts with a list of webpage URLs. When ProxyPy visits these URLs, it saves hyperlinks from the page for further crawling. This list, also known as the "crawl frontier", is repeatedly visited according to a set of ProxyPy policies to effectively map a site for updates: content changes, new pages, and dead links.
How To Block ProxyPy From Crawling Your Site
Bots are crawling your web pages to help parse your site content, so the relevant information within your site is easily indexed and more readily available to users searching for the content you provide.
Although most bots are harmless and even quite beneficial, you may still want to prevent them from crawling your site (please note, however, that not everyone on the web is using a bot to help index your site). The easiest and quickest way to do this is to use the robots.txt file. This text file contains instructions on how a bot should process your site data.
Important: The robots.txt file must be placed in the top directory of the website host to which it applies. Otherwise, it will have no effect on the ProxyPy behavior.
To stop ProxyPy from crawling your site, add the following rules to your robots.txt file:
To block ProxyPy from crawling your site for a webgraph of links:
If you have subdomains, you need to place a robots.txt file on each subdomain. Otherwise, ProxyPy will not address any other file in your domain, and will consider that it is allowed to crawl everything on your subdomain.
The robots.txt file must always return an HTTP 200 status code. If a 4xx status code is returned, ProxyPy will assume that no robots.txt exists and there are no crawl restrictions. Returning a 5xx status code for your robots.txt file will prevent ProxyPy from crawling your entire site. Our crawler can handle robots.txt files with a 3xx status code.
Please note that it may take up to one hour or 100 requests for ProxyPy to discover changes made to your robots.txt.
Do not try to block ProxyPy via IP as we do not use any consecutive IP blocks.