• 歡迎訪問奇跡の海網站,本站不上傳任何資源,所有資源均來自于網絡,歡迎加入奇跡の海~!奇跡の海-WordPress QQ群
  • 本站下載資源為網絡上收集整理而來,并且以計算機技術研究交流為目的,版權歸原作者所有,僅供大家參考,學習,不存在任何商業目的與商業用途.
  • 本站系統鏡像均來自于官方原版,ed2k可視為P2P下載鏈接。所有操作系統默認均為試用版,如有正版密鑰可以有效激活,本站不提供任何激活和相關服務。

阻止網絡機器人爬取網站內容

網絡技術 奇跡の海 2年前 (2017-12-07) 1759次瀏覽 已收錄 2個評論 掃描二維碼

阻止網絡機器人爬取網站內容

周末大清早收到封警報郵件,估計網站被攻擊了,要么就是緩存日志memory的問題。打開access.log 看了一眼,原來該時間段內大波的bot(bot: 網上機器人;自動程序 a computer programthat performs a particular task again and again many times)訪問了我的網站。

http://ltx71.com
http://mj12bot.com
http://www.bing.com/bingbot.htm
http://ahrefs.com/robot/
http://yandex.com/bots

website.com (AWS) – Monitor is Down

Down since Mar 25, 2017 1:38:58 AM CET

Site Monitored

http://www.website.com

Resolved IP

54.171.32.xx

Reason

Service Unavailable.

Monitor Group

XX Applications

Outage Details

Location

Resolved IP

Reason

London – UK (5.77.35.xx) 54.171.32.xx Service Unavailable.
Headers :?
HTTP/1.1 503 Service Unavailable: Back-end server is at capacity
Content-Length : 0
Connection : keep-alive

GET / HTTP/1.1
Cache-Control : no-cache
Accept : */*
Connection : Keep-Alive
Accept-Encoding : gzip
User-Agent : Site24x7
Host : xxx

Seattle – US (104.140.20.xx) 54.171.32.xx Service Unavailable.
Headers :?
HTTP/1.1 503 Service Unavailable: Back-end server is at capacity
Content-Length : 0
Connection : keep-alive

GET / HTTP/1.1
Cache-Control : no-cache
Accept : */*
Connection : Keep-Alive
Accept-Encoding : gzip
User-Agent : Site24x7
Host : xxx

 

上網搜了一下,發現許多webmaster都遇到了由于bot短期密集訪問形成的流量高峰而無法其它終端提供服務的問題。從這篇文章的分析中,我們看到有這樣幾種方法來block這些web bot。

1.??????robots.txt

許多網絡爬蟲都是先去搜索robots.txt,如下所示:

“199.58.86.206” – – [25/Mar/2017:01:26:50 +0000] “GET /robots.txt HTTP/1.1″ 404 341 “-” “Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)”
“199.58.86.206” – – [25/Mar/2017:01:26:54 +0000] “GET / HTTP/1.1” 200 129989 “-” “Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)”
“162.210.196.98” – – [25/Mar/2017:01:39:18 +0000] “GET /robots.txt HTTP/1.1″ 404 341 “-” “Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)”

 

許多bot的發布者也談到了如果不希望被爬取,應該如何來操作,就以MJ12bot為例:

How can I block MJ12bot?

MJ12bot adheres to the?robots.txt?standard. If you want the bot to prevent website from being crawled then add the following text to your robots.txt:

User-agent: MJ12bot
Disallow: /

Please do not waste your time trying to block bot via IP in htaccess – we do not use any consecutive IP blocks so your efforts will be in vain. Also please make sure the bot can actually retrieve robots.txt itself – if it can’t then it will assume (this is the industry practice) that its okay to crawl your site.

If you have reason to believe that MJ12bot did NOT obey your robots.txt commands, then please let us know via email:?[email protected]. Please provide URL to your website and log entries showing bot trying to retrieve pages that it was not supposed to.

How can I slow down MJ12bot?

You can easily slow down bot by adding the following to your robots.txt file:

User-Agent: MJ12bot
Crawl-Delay:?? 5

Crawl-Delay should be an integer number and it signifies number of seconds of wait between requests. MJ12bot will make an up to 20 seconds delay between requests to your site – note however that while it is unlikely, it is still possible your site may have been crawled from multiple MJ12bots at the same time. Making high Crawl-Delay should minimise impact on your site. This Crawl-Delay parameter will also be active if it was used for * wildcard.

If our bot detects that you used Crawl-Delay for any other bot then it will automatically crawl slower even though MJ12bot specifically was not asked to do so.

 

那么我們可以寫如下的

User-agent: YisouSpider

Disallow: /

User-agent: EasouSpider

Disallow: /

User-agent: EtaoSpider

Disallow: /

User-agent: MJ12bot

Disallow: /

 

另外,鑒于很多bot都會去訪問這些目錄:

/wp-login.php
/wp-admin/

/trackback/

/?replytocom=

許多WordPress網站也確實用到了這些文件夾,那么如何在不影響功能的情況下做一些調整呢?

robots.txt修改之前 robots.txt修改之后
User-agent: *

Disallow: /wp-admin

Disallow: /wp-content/plugins

Disallow: /wp-content/themes

Disallow: /wp-includes

Disallow: /?s=

 

User-agent: *

Disallow: /wp-admin

Disallow: /wp-*

Allow: /wp-content/uploads/

Disallow: /wp-content

Disallow: /wp-login.php

Disallow: /comments

Disallow: /wp-includes

Disallow: /*/trackback

Disallow: /*?replytocom*

Disallow: /?p=*&preview=true

Disallow: /?s=

不過,也可以看到許多爬蟲并不理會robots.txt,以這個為例,就沒有先去訪問robots.txt

“10.70.8.30, 163.172.65.40” – – [25/Mar/2017:02:13:36 +0000] “GET / HTTP/1.1” 200 129989 “-” “Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)”
“178.63.23.67, 163.172.65.40” – – [25/Mar/2017:02:13:42 +0000] “GET / HTTP/1.1” 200 129989 “-” “Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)”
“178.63.23.67, 163.172.65.40” – – [25/Mar/2017:02:14:17 +0000] “GET /static/js/utils.js HTTP/1.1” 200 5345 “http://iatatravelcentre.com/” “Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)”
“178.63.23.67, 163.172.65.40” – – [25/Mar/2017:02:14:17 +0000] “GET /static/css/home.css HTTP/1.1” 200 8511 “http://iatatravelcentre.com/” “Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)”

這個時候就要試一下其他幾種方法。

 

2.??????.htaccess

原理就是利用URL rewrite,只要發現訪問來自于這些agent,就禁止其訪問。作者“~吉爾伽美什”的這篇文章介紹了關于.htaccess的很多用法。

5. Blocking users by IP 根據IP阻止用戶訪問
order allow,deny
deny from 123.45.6.7
deny from 12.34.5. (整個C類地址)
allow from all

6. Blocking users/sites by referrer 根據referrer阻止用戶/站點訪問
需要mod_rewrite模塊
例1. 阻止單一referrer: badsite.com
RewriteEngine on
# Options +FollowSymlinks
RewriteCond %{HTTP_REFERER} badsite\.com [NC]
RewriteRule .* – [F]
例2. 阻止多個referrer: badsite1.com, badsite2.com
RewriteEngine on
# Options +FollowSymlinks
RewriteCond %{HTTP_REFERER} badsite1\.com [NC,OR]
RewriteCond %{HTTP_REFERER} badsite2\.com
RewriteRule .* – [F]
[NC] – 大小寫不敏感(Case-insensite)
[F] – 403 Forbidden
注意以上代碼注釋掉了”Options +FollowSymlinks”這個語句。如果服務器未在 httpd.conf 的 段落設置 FollowSymLinks, 則需要加上這句,否則會得到”500 Internal Server error”錯誤。

7. Blocking bad bots and site rippers (aka offline browsers) 阻止壞爬蟲和離線瀏覽器
需要mod_rewrite模塊
壞爬蟲? 比如一些抓垃圾email地址的爬蟲和不遵守robots.txt的爬蟲(如baidu?)
可以根據 HTTP_USER_AGENT 來判斷它們
(但是還有更無恥的如”中搜 zhongsou.com”之流把自己的agent設置為 “Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)” 太流氓了,就無能為力了)
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:[email protected] [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* – [F,L]
[F] – 403 Forbidden
[L] – 連接(Link)

8. Change your default directory page 改變缺省目錄頁面?
DirectoryIndex index.html index.php index.cgi index.pl

9. Redirects 轉向
單個文件
Redirect /old_dir/old_file.html http://yoursite.com/new_dir/new_file.html
整個目錄
Redirect /old_dir http://yoursite.com/new_dir
效果: 如同將目錄移動位置一樣
http://yoursite.com/old_dir -> http://yoursite.com/new_dir
http://yoursite.com/old_dir/dir1/test.html -> http://yoursite.com/new_dir/dir1/test.html
Tip: 使用用戶目錄時Redirect不能轉向的解決方法
當你使用Apache默認的用戶目錄,如 http://mysite.com/~windix,當你想轉向 http://mysite.com/~windix/jump時,你會發現下面這個Redirect不工作:
Redirect /jump http://www.google.com
正確的方法是改成
Redirect /~windix/jump http://www.google.com
(source: .htaccess Redirect in “Sites” not redirecting: why?
)

10. Prevent viewing of .htaccess file 防止.htaccess文件被查看
order allow,deny
deny from all

 

3.??????拒絕IP的訪問

可以在Apache配置文件httpd.conf指明拒絕來自某些IP的訪問。

<Directory “/var/www/html”>

Order allow,deny

Allow from all

Deny from 5.9.26.210

??? Deny from 162.243.213.131

</Directory>

但是由于很多時候,這些訪問的IP并不固定,所以這種方法不太方便,而且修改了httpd.conf還要重啟apache才能生效,所以建議采用修改.htaccess。


版權聲明:本站所有文章和資源使用CC BY-NC-SA 4.0協議授權發布 , 轉載應當以相同方式注明文章來自“SeaOMC.COM->阻止網絡機器人爬取網站內容!在下邊可以分享本文哦!
喜歡 (0)
[]
分享 (0)
奇跡の海
關于作者:
一個WordPress菜鳥!
發表我的評論
取消評論

表情 貼圖 加粗 刪除線 居中 斜體 簽到

Hi,您需要填寫昵稱和郵箱!

  • 昵稱 (必填)
  • 郵箱 (必填)
  • 網址
(2)個小伙伴在吐槽
  1. This text is invaluable. When can I find out more?
    Yuna Shiina2019-02-25 01:42 回復 Windows 7 | 360瀏覽器 SE
  2. I think the admin of this web page is truly working hard in favor of his website, because here every material is quality based material.
    Rui Saotome2019-02-28 23:26 回復 Mac OS X | Chrome 65.0.3325.181
中国福利彩票36选7开奖结果