这家爬虫完全不会遵守robots协议,在他们的网站上您可以看到,禁止此爬虫的协议文本为?
禁止谷歌爬虫的时候可以一并禁止ImageSift Bot,但是从域名信息来看,这家从2023年成立的爬虫数据公司同谷歌没有任何关系
User-Agent: * Disallow: / User-Agent: Googlebot Allow: / Disallow: /private/
Does ImageSiftBot follow Robots.txt rules?
Standard directives in robots.txt that target ImagesiftBot are respected. For example, the following will allow ImagesiftBot to crawl all pages, except those under /private/:
User-Agent: ImagesiftBot Allow: / Disallow: /private/
ImagesiftBot also supports the crawl-delay directive in robots.txt files. It interprets the value as the minimum duration, in seconds, between the start of consecutive requests. For example, assume you have specified the following in your robots.txt file:
User-Agent: ImagesiftBot Crawl-delay: 5
ImagesiftBot will split each day into 5 second intervals and issue at most one request to your domain inside each interval.
If there is no rule targeting ImagesiftBot, but there is a rule targeting Googlebot, then ImagesiftBot will follow the Googlebot directives. For example, ImagesiftBot will fetch all pages, except those under /private/ with the following robots.txt:
User-Agent: * Disallow: / User-Agent: Googlebot Allow: / Disallow: /private/
What information does ImageSiftBot save?
Along with images, ImageSiftBot saves the following information:
Host URL and text on the page
Alt text associated with image
How do we use this information?
Once images and text are downloaded from a webpage, ImageSift analyzes this data from the page and stores the information in an index. Our web intelligence products use this index to enable search and retrieval of similar images.