Petalbot - HTMLHelp Forums

Welcome Guest ( Log In | Register )

Petalbot

Options

Brian Chandler	Jan 13 2022, 11:47 PM Post #1
Jocular coder Group: Members Posts: 2,460 Joined: 31-August 06 Member No.: 43	My error log is full of accesses to the nonexistent https://imaginatorium.com/addbskt.php from something identifying itself as Petalbot. This links to a page here: https://webmaster.petalsearch.com/site/petalbot This explains that Petalbot follows the robots.txt protocol, and describes how to block it by (e.g.) CODE User-agent: PetalBot Disallow: /.php But https://imaginatorium.com/robots.txt already includes CODE User-agent: Allow: /.html Disallow: /.php Unless I misunderstand something, if Petalbot followed the robots.txt protocol it would not attempt to access this page. Or do I have to go around adding in the names of all the robots I want to exclude?

Replies

Christian J	Jan 14 2022, 08:41 AM Post #2
. Group: WDG Moderators Posts: 9,686 Joined: 10-August 06 Member No.: 7	QUOTE(Brian Chandler @ Jan 14 2022, 05:47 AM) My error log is full of accesses to the nonexistent https://imaginatorium.com/addbskt.php from something identifying itself as Petalbot. How did it find addbskt.php, has that page existed previously? If Petalbot is just guessing URLs I would regard it as malicious and ban. QUOTE CODE User-agent: * Allow: /.html Disallow: /.php According to http://www.robotstxt.org/robotstxt.html things like "/.php" seems invalid: QUOTE Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/" or "Disallow: *.gif". Also, an Allow field does not exist, except as an extension: https://en.wikipedia.org/wiki/Robots_exclus...Allow_directive (I have no idea if Petalbot supports it). Personally I use a spider trap that auto-bans anyone that does not follow my robots.txt directives. Usually there are a few catches every month.

Brian Chandler	Jan 15 2022, 02:19 AM Post #3
Jocular coder Group: Members Posts: 2,460 Joined: 31-August 06 Member No.: 43	Thanks Christian. I don't know where I got my version from; everywhere generally says something different... QUOTE(Christian J @ Jan 14 2022, 10:41 PM) QUOTE(Brian Chandler @ Jan 14 2022, 05:47 AM) My error log is full of accesses to the nonexistent https://imaginatorium.com/addbskt.php from something identifying itself as Petalbot. How did it find addbskt.php, has that page existed previously? If Petalbot is just guessing URLs I would regard it as malicious and ban. There is an addbskt.php on the old site, and I suppose at some stage during development I had references to it on the new site. I intended to have robots banned at that stage, but who knows. It seemed neat just to block all php pages, which are 'operations' not appropriate for search targeting, and allow all html pages, which are 'content'. I can just make an explicit list of *.php. Petalbot does seem to be a bit slow, because it is trawling though pages from months back, but in the end that's not a problem.

Posts in this topic

Brian Chandler Petalbot Jan 13 2022, 11:47 PM

Christian J My error log is full of accesses to the nonexiste... Jan 14 2022, 08:41 AM

Brian Chandler Thanks Christian. I don't know where I got my ... Jan 15 2022, 02:19 AM

Christian J According to this page, Petalbot does not respect ... Jan 15 2022, 08:38 AM

Brian Chandler According to this page, Petalbot does not respect... Jan 15 2022, 09:44 AM

Christian J Their robots.txt is curious: it just blocks /wp-a... Jan 15 2022, 11:49 AM

Brian Chandler Well, I am still seeing huge numbers of robot acce... Jan 27 2022, 12:08 PM

Christian J Well, I am still seeing huge numbers of robot acc... Jan 27 2022, 05:13 PM

pandy This article is over a year old, but of interest a... Jan 27 2022, 11:03 PM

Brian Chandler [quote name='Brian Chandler' post='143147' date='... Jan 28 2022, 01:57 AM

Brian Chandler Just to record: accesses by Petalbot to the .php f... Feb 4 2022, 04:24 AM

1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)

0 Members:

Display Mode: Switch to: Standard · Switch to: Linear+ · Outline

Time is now: 15th June 2024 - 01:40 PM

Licensed to: HTMLHelp.com, LLC