Petalbot |
Petalbot |
Brian Chandler |
Jan 13 2022, 11:47 PM
Post
#1
|
Jocular coder Group: Members Posts: 2,461 Joined: 31-August 06 Member No.: 43 |
My error log is full of accesses to the nonexistent https://imaginatorium.com/addbskt.php from something identifying itself as Petalbot. This links to a page here:
https://webmaster.petalsearch.com/site/petalbot This explains that Petalbot follows the robots.txt protocol, and describes how to block it by (e.g.) CODE User-agent: PetalBot Disallow: /*.php But https://imaginatorium.com/robots.txt already includes CODE User-agent: * Allow: /*.html Disallow: /*.php Unless I misunderstand something, if Petalbot followed the robots.txt protocol it would not attempt to access this page. Or do I have to go around adding in the names of all the robots I want to exclude? |
Christian J |
Jan 14 2022, 08:41 AM
Post
#2
|
. Group: WDG Moderators Posts: 9,688 Joined: 10-August 06 Member No.: 7 |
My error log is full of accesses to the nonexistent https://imaginatorium.com/addbskt.php from something identifying itself as Petalbot. How did it find addbskt.php, has that page existed previously? If Petalbot is just guessing URLs I would regard it as malicious and ban. QUOTE CODE User-agent: * Allow: /*.html Disallow: /*.php According to http://www.robotstxt.org/robotstxt.html things like "/*.php" seems invalid: QUOTE Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif". Also, an Allow field does not exist, except as an extension: https://en.wikipedia.org/wiki/Robots_exclus...Allow_directive (I have no idea if Petalbot supports it). Personally I use a spider trap that auto-bans anyone that does not follow my robots.txt directives. Usually there are a few catches every month. |
Brian Chandler |
Jan 15 2022, 02:19 AM
Post
#3
|
Jocular coder Group: Members Posts: 2,461 Joined: 31-August 06 Member No.: 43 |
Thanks Christian. I don't know where I got my version from; everywhere generally says something different...
My error log is full of accesses to the nonexistent https://imaginatorium.com/addbskt.php from something identifying itself as Petalbot. How did it find addbskt.php, has that page existed previously? If Petalbot is just guessing URLs I would regard it as malicious and ban. There is an addbskt.php on the old site, and I suppose at some stage during development I had references to it on the new site. I intended to have robots banned at that stage, but who knows. It seemed neat just to block all php pages, which are 'operations' not appropriate for search targeting, and allow all html pages, which are 'content'. I can just make an explicit list of *.php. Petalbot does seem to be a bit slow, because it is trawling though pages from months back, but in the end that's not a problem. |
Christian J |
Jan 15 2022, 08:38 AM
Post
#4
|
. Group: WDG Moderators Posts: 9,688 Joined: 10-August 06 Member No.: 7 |
According to this page, Petalbot does not respect robots.txt (though no examples are given):
https://www.hypernode.com/blog/hosting/huaw...-online-stores/ This page says it does respect robots.txt, but is too aggressive: https://james-william-fletcher.medium.com/h...er-f17c30e061e7 |
Brian Chandler |
Jan 15 2022, 09:44 AM
Post
#5
|
Jocular coder Group: Members Posts: 2,461 Joined: 31-August 06 Member No.: 43 |
According to this page, Petalbot does not respect robots.txt (though no examples are given): https://www.hypernode.com/blog/hosting/huaw...-online-stores/ Their robots.txt is curious: it just blocks /wp-admin/ - not apparently what they are complaining about - but tries to 'allow' /wp-admin/admin-ajax.php which looks very odd for something you would want a bot poking at. QUOTE This page says it does respect robots.txt, but is too aggressive: https://james-william-fletcher.medium.com/h...er-f17c30e061e7 Didn't read very carefully, but they seem to have committed a "By default, execute the command" error. Neither very convincing, frankly... so the muddle continues. |
Christian J |
Jan 15 2022, 11:49 AM
Post
#6
|
. Group: WDG Moderators Posts: 9,688 Joined: 10-August 06 Member No.: 7 |
Their robots.txt is curious: it just blocks /wp-admin/ - not apparently what they are complaining about - but tries to 'allow' /wp-admin/admin-ajax.php which looks very odd for something you would want a bot poking at. Odd indeed. Maybe they don't care about robots.txt entries for well-behaving bots, while bad bots need to be blocked in other ways (since they ignore robots.txt anyway). |
Brian Chandler |
Jan 27 2022, 12:08 PM
Post
#7
|
Jocular coder Group: Members Posts: 2,461 Joined: 31-August 06 Member No.: 43 |
Well, I am still seeing huge numbers of robot accesses to .php files. Not only Petalbot, also DuckDuckWhateveritis, and others. Here is my robots.txt file, as of about two weeks ago; does it look OK?
https://imaginatorium.com/robots.txt And how long do you think I need to give bots to update their copy of robots.txt? Any ideas? |
Christian J |
Jan 27 2022, 05:13 PM
Post
#8
|
. Group: WDG Moderators Posts: 9,688 Joined: 10-August 06 Member No.: 7 |
Well, I am still seeing huge numbers of robot accesses to .php files. Not only Petalbot, also DuckDuckWhateveritis, and others. Here is my robots.txt file, as of about two weeks ago; does it look OK? https://imaginatorium.com/robots.txt I wouldn't use ending slashes, unless e.g. "ack.php" is a directory and not a PHP file... QUOTE And how long do you think I need to give bots to update their copy of robots.txt? Any ideas? No idea what they actually do. But since the purpose of a returning bot is to update its database, surely that would include the robots.txt file as well (if they care about it)? |
pandy |
Jan 27 2022, 11:03 PM
Post
#9
|
🌟Computer says no🌟 Group: WDG Moderators Posts: 20,744 Joined: 9-August 06 Member No.: 6 |
This article is over a year old, but of interest anyway. petalbot flooded European shopping sites with requests at that time.
https://www.hypernode.com/blog/huawei-aspie...-online-stores/ I find the requests for addbskt.php strange. If it's looking for a shopping cart it ought to try several possible URLs, one would think. This was also interesting. Not about petalbot per se, but about the same kind of trouble coming from an IP range called the Huawei Cloud. Looks like Huawei is fishy in more way than one. https://support-acquia.force.com/s/article/...he-Huawei-Cloud |
Brian Chandler |
Jan 28 2022, 01:57 AM
Post
#10
|
Jocular coder Group: Members Posts: 2,461 Joined: 31-August 06 Member No.: 43 |
Well, I am still seeing huge numbers of robot accesses to .php files. Not only Petalbot, also DuckDuckWhateveritis, and others. Here is my robots.txt file, as of about two weeks ago; does it look OK? https://imaginatorium.com/robots.txt I wouldn't use ending slashes, unless e.g. "ack.php" is a directory and not a PHP file... Thanks Christian! My blunder somehow. Pandy's links are interesting, but rather evidence-free claims of Petalbot not complying with robots.txt. I'll see what happens now. I don't think we can expect them to read the robots.txt file every day, even - something like once a week or month would seem quite reasonable, so I am happy to be patient. |
Brian Chandler |
Feb 4 2022, 04:24 AM
Post
#11
|
Jocular coder Group: Members Posts: 2,461 Joined: 31-August 06 Member No.: 43 |
Just to record: accesses by Petalbot to the .php files in my robots.txt "Disallow" list appear to have stopped. So I think we can say that Petalbot follows the robots protocol.
|
Lo-Fi Version | Time is now: 25th June 2024 - 11:58 AM |