Apache problem, Weird addresses being served... |
Apache problem, Weird addresses being served... |
Brian Chandler |
Jul 14 2024, 10:28 AM
Post
#1
|
Jocular coder Group: Members Posts: 2,480 Joined: 31-August 06 Member No.: 43 |
I suddenly got a warning from Pair Networks that my "bandwidth" figure was several hundred times normal, with a projected 4-digit (dollar) bill. I managed to track down the problem, and Pair agreed to waive any surcharge, so good for them. But the underlying problem is weird. The problem page was https://imaginatorium.org/sano/tanbo.htm - which I made 20+ years ago.
But to go back to the beginning, the weirdness is this. Try the page: https://imaginatorium.com/ensky.html - should work. Now try https://imaginatorium.com/ensky.html/ship.php - I expect this to fail, since there is no file called ensky.html/ship.php in the relevant directory. But Apache simply serves the same page: it appears to go through the url until it finds an existing file, ignoring the rest of the string, BUT treating the last slash in the url as marking the "current directory". So if you click any of the links on this page, for example "Shop front" it goes to https://imaginatorium.com/ensky.html/shop.html - and this of course returns the current page all over again. My problem page included an iframe (remember that???) including an image; this is the scrolling panorama near the top, which I have just reimplemented using css. The old version is commented out; here it is: QUOTE <center> <iframe width="80%" height=196 src="pics/b045pano.htm" marginheight=0 marginwidth=0><a href="pics/b045pano.jpg">Panorama</a></iframe> <p class=caption>Bare paddy-fields - 250-degree panorama of the Kanto Plain in winter</p> </center> So what happened was a (genuine bot) access to "GET /sano/tanbo.htm/art/art/books/art/books/books/pics/art/pics/web/web/guest/pics/guest/art/pics/pics/stuff.htm HTTP/1.1" 200 12311 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)". This served the page, then inside the iframe served the same page again with a different (ignored) bit on the end, and so on, recursing indefinitely. So what is going on? I cannot believe that this was the intention of the http designers. I tried the Apache documentation, but could not anything resembling a specification, just a vague statemement about the file tree, and lots of "Try this, it may work for you" type stuff. Is this standard Apache behaviour, or could it be some problem with the Pair.com implementation? Can someone try the same trick on their own server? (This test will not work on the original paddy-field page, because I put this in .htaccess ...) QUOTE DirectoryIndex sano.htm index.htm AddType application/x-httpd-php .php .htm # RewriteEngine On RewriteBase / # Block bogus tanbo.htm/art/stuff/... and deliver 403 access denied RewriteCond %{REQUEST_URI} \.htm/ RewriteRule \.htm/ - [F,L] Grateful for suggestions: if I can work out whether it is specific to pair.com I can go either to them or to Apache support... |
pandy |
Jul 14 2024, 10:43 AM
Post
#2
|
🌟Computer says no🌟 Group: WDG Moderators Posts: 20,766 Joined: 9-August 06 Member No.: 6 |
But to go back to the beginning, the weirdness is this. Try the page: https://imaginatorium.com/ensky.html - should work. Now try https://imaginatorium.com/ensky.html/ship.php - I expect this to fail, since there is no file called ensky.html/ship.php in the relevant directory. But Apache simply serves the same page: it appears to go through the url until it finds an existing file, ignoring the rest of the string, BUT treating the last slash in the url as marking the "current directory". So if you click any of the links on this page, for example "Shop front" it goes to https://imaginatorium.com/ensky.html/shop.html - and this of course returns the current page all over again. I don't know. https://imaginatorium.com/ensky.html/ship.php isn't a valid URL (can't do the slash thing after a file name). Could it be the server is configured to just ignore a slash after a file name and any mumbo jumbo that comes after it and just reload the page? Don't know why it would be, but it's all I can think of. https://imaginatorium.com/ensky.html/ https://imaginatorium.com/ensky.html/qwertyuio https://imaginatorium.com/ensky.html/qwertyuio.html |
Christian J |
Jul 14 2024, 08:11 PM
Post
#3
|
. Group: WDG Moderators Posts: 9,743 Joined: 10-August 06 Member No.: 7 |
Now try https://imaginatorium.com/ensky.html/ship.php - I expect this to fail, since there is no file called ensky.html/ship.php in the relevant directory. But Apache simply serves the same page: it appears to go through the url until it finds an existing file, ignoring the rest of the string, BUT treating the last slash in the url as marking the "current directory". I have a host that suddenly enabled "content negotiation" (is that the term?) on my site without asking or telling, IIRC making URLs like example.com/foo return example.com/foo.html (instead of example.com/foo/ as usual). Could it be that your host is using something similar, but more advanced? QUOTE Can someone try the same trick on their own server? I can check tomorrow if I remember, but there seem to be lots of different Apache configurations these days, and I have no idea where the documentation for them is. For example, the same host as above tends to make my htaccess directives stop working every time they change their Unix server software. |
Christian J |
Jul 14 2024, 08:21 PM
Post
#4
|
. Group: WDG Moderators Posts: 9,743 Joined: 10-August 06 Member No.: 7 |
I don't know. https://imaginatorium.com/ensky.html/ship.php isn't a valid URL (can't do the slash thing after a file name). In other words you can't use period signs in directory names (like you can with Windows folders, in this case the folder "ensky.html")? |
pandy |
Jul 14 2024, 08:49 PM
Post
#5
|
🌟Computer says no🌟 Group: WDG Moderators Posts: 20,766 Joined: 9-August 06 Member No.: 6 |
Is ensky.html a directory and not a HTML document?
|
Dag |
Jul 15 2024, 02:32 AM
Post
#6
|
Advanced Member Group: Members Posts: 122 Joined: 24-October 06 Member No.: 549 |
... Could it be the server is configured to just ignore a slash after a file name and any mumbo jumbo that comes after it and just reload the page? Don't know why it would be, but it's all I can think of. https://imaginatorium.com/ensky.html/qwertyuio.html Interesting... seems that default apache (or browsers?) attitude is to ignore ending backslah. The same cases on my server. Here too! You should try: https://forums.htmlhelp.com/index.php?act=idx idx is variable https://forums.htmlhelp.com/index.php?act=idx/ but the above one also works (incredible!). The next one too: https://forums.htmlhelp.com/index.php?act=idx/abracadabra In your case, you can't see https://imaginatorium.com/ship.php in URI https://imaginatorium.com/ensky.html/ship.php because 'ensky.html' is real valid file which is returned. I am not sure that it has anything to do with htaccess content negotiating which deals with various file types that have the same name In cases of existing real files: https://imaginatorium.com/ensky.html https://imaginatorium.com/ensky.jpg https://imaginatorium.com/ensky.php URI request of https://imaginatorium.com/ensky server will negotiate and return that one which 'he' decide is the proper solution. This works (file is 'analize.html'): http://www.laban.rs/r/a/analize but this don't (file is 'ensky.html' - 404 returned): https://imaginatorium.com/ensky Your content nagotiating is off. |
Christian J |
Jul 15 2024, 07:32 AM
Post
#7
|
. Group: WDG Moderators Posts: 9,743 Joined: 10-August 06 Member No.: 7 |
|
pandy |
Jul 15 2024, 09:53 AM
Post
#8
|
🌟Computer says no🌟 Group: WDG Moderators Posts: 20,766 Joined: 9-August 06 Member No.: 6 |
Interesting... seems that default apache (or browsers?) attitude is to ignore ending backslah. The same cases on my server. Here too! You should try: https://forums.htmlhelp.com/index.php?act=idx idx is variable https://forums.htmlhelp.com/index.php?act=idx/ but the above one also works (incredible!). The next one too: https://forums.htmlhelp.com/index.php?act=idx/abracadabra But it's part of the query string in those examples. With the construction Brian uses we get a 404. https://htmlhelp.com/reference/html40/entit.../qwertyuio.html |
pandy |
Jul 15 2024, 10:08 AM
Post
#9
|
🌟Computer says no🌟 Group: WDG Moderators Posts: 20,766 Joined: 9-August 06 Member No.: 6 |
Is ensky.html a directory and not a HTML document? In Brian's case it's obviously an HTML document. But when followed by a slash it looks like a directory. I don't know if it's valid URL though. Surely periods in file and folder names are allowed on *nix too. Question is how to construct an URL to them to avoid they are seen as filename.ext. If you have an URL like this http:/exampel.com/valid.folder/ Will the server pick the index file in valid.folder or will it ignore the ending slash and try to find a file called valid with the extension folder? But what I meant with not valid in this case is that AFIK a slash can't be used in the way Brian uses it https://imaginatorium.com/ensky.html/ship.php . What's that last slash even meant to mean? |
Christian J |
Jul 15 2024, 10:30 AM
Post
#10
|
. Group: WDG Moderators Posts: 9,743 Joined: 10-August 06 Member No.: 7 |
Can someone try the same trick on their own server? Both my webhost and my local XAMPP test server seem to do the same. This: CODE localhost/foo/existing-file.html/non-existing-file.html returns the content of this HTML file CODE localhost/foo/existing-file.html |
Christian J |
Jul 15 2024, 10:35 AM
Post
#11
|
. Group: WDG Moderators Posts: 9,743 Joined: 10-August 06 Member No.: 7 |
Surely periods in file and folder names are allowed on .nix too. Question is how to construct an URL to them to avoid they are seen as filename.ext. If you have an URL like this http:/exampel.com/valid.folder/ Will the server pick the index file in valid.folder That's what I would assume. Doesn't the ending slash indicate that "valid.folder" should be a folder? If no such folder exists, the server should return a 404. QUOTE or will it ignore the ending slash and try to find a file called valid with the extension folder? That's what seems to happen in practice, but it doesn't make sense to me. Maybe it's another example of software trying to be "helpful". QUOTE But what I meant with not valid in this case is that AFIK a slash can't be used in the way Brian uses it https://imaginatorium.com/ensky.html/ship.php . What's that last slash even meant to mean? I assume that's a buggy URL. If it was intentional, "ensky.html" would have to be a folder (obviously it would be a confusing name for a folder). |
Brian Chandler |
Jul 15 2024, 11:40 AM
Post
#12
|
Jocular coder Group: Members Posts: 2,480 Joined: 31-August 06 Member No.: 43 |
Thanks for responses. Some points in no particular order...
In Unix a directory name can be almost anything, including fred.html; the only prohibited character is / (slash). I believe that https://any.domain.name followed by almost anything, separated by / ? or perhaps other non-alphanumeric characters is a "valid" URL - it is up to the server to interpret it as it wishes. And in Dag's example https://forums.htmlhelp.com/index.php?act=idx/abracadabra the argument string should result in a GET variable with a value of "idx/abracadabra", so again it is only a question of how the server parses this. If it does an initial string match for 'idx' then 'idx/abracadabra' will act as 'idx'; if you check whether the GET argument is exactly 'idx', then it should fail. But anyway, the problem is that the Apache behaviour I am getting is wildly different from what I expect, and I don't know if this is simply Apache "more-or-less normal", caused by some strange setting on the Pair server, caused by my .htaccess settings, or just a bug. And whether Apache has a policy of responding to support requests by sorting out bugs, or whether it's just the wild west. Because it is a shared server, I have to put .htaccess files in each directory, which slows down the server; it has to look at each directory in term, so logically it might be parsing the file path, rather than just throwing the whole filename at the file system. Even so, it should not be treating the name of the currect directory as different from the actual directory from which it got the file. The people at Pair did not say much, but something about having error handling directives for 403 and 404, and how "recently" they have seen more of these circling URLs including chains of bogus directories. Anyway, I would be grateful if someone else can check a different server with the recursive iframe example. Don't leave it there for long, because if a bot finds it you risk a large surcharge. |
Christian J |
Jul 15 2024, 03:34 PM
Post
#13
|
. Group: WDG Moderators Posts: 9,743 Joined: 10-August 06 Member No.: 7 |
I don't know if this is simply Apache "more-or-less normal", caused by some strange setting on the Pair server, caused by my .htaccess settings, or just a bug. I suppose neither the webhost nor your .htaccess file are responsible, since I got the same result in XAMPP. Can't say if it's an Apache bug without knowing what the spec says (which I don't). Is it this one? https://www.rfc-editor.org/rfc/rfc3986 Maybe someone with a non-Apache server could test as well... QUOTE And whether Apache has a policy of responding to support requests by sorting out bugs, or whether it's just the wild west. Seems there are a few Apache forums, someone there might know. QUOTE "recently" they have seen more of these circling URLs including chains of bogus directories. Something must have caused such a recent change. Maybe MJ12bot has become buggy (or more intrusive) lately? I've also read that scraper bots are disguising themselves with that "MJ12bot" name. |
pandy |
Jul 15 2024, 06:37 PM
Post
#14
|
🌟Computer says no🌟 Group: WDG Moderators Posts: 20,766 Joined: 9-August 06 Member No.: 6 |
The behavior is not the same on this domain.
|
Christian J |
Jul 15 2024, 08:13 PM
Post
#15
|
. Group: WDG Moderators Posts: 9,743 Joined: 10-August 06 Member No.: 7 |
You're right, this gives a 404:
CODE https://htmlhelp.com/reference/html40/structure.html/foo.html Good idea to check with any server, no need to just use our own. |
Brian Chandler |
Jul 15 2024, 11:20 PM
Post
#16
|
Jocular coder Group: Members Posts: 2,480 Joined: 31-August 06 Member No.: 43 |
I don't know if this is simply Apache "more-or-less normal", caused by some strange setting on the Pair server, caused by my .htaccess settings, or just a bug. I suppose neither the webhost nor your .htaccess file are responsible, since I got the same result in XAMPP. Can't say if it's an Apache bug without knowing what the spec says (which I don't). Is it this one? https://www.rfc-editor.org/rfc/rfc3986 ... Quote: "This specification does not mandate a particular registered name lookup technology and therefore does not restrict the syntax of reg-name beyond what is necessary for interoperability." In other words all of what we are talking about are valid URLs; the question is about how Apache should do with them. I have a fix: add a .htaccess rewrite rule to replace any of .html/ .php/ etc etc by a 503 response. But this seems crazy, me having to patch against non-standard behaviour. |
Brian Chandler |
Jul 20 2024, 11:00 AM
Post
#17
|
Jocular coder Group: Members Posts: 2,480 Joined: 31-August 06 Member No.: 43 |
I am still trying to get to the bottom of this. This page simply echos what Apache thinks it is doing:
https://imaginatorium.org/stuff/echo.php Add anything on to the end, and see what happens - just a slash is enough to break everything: https://imaginatorium.org/stuff/echo.php/ Then if you add more, e.g. https://imaginatorium.org/stuff/echo.php/fish/chips.html ...in particular the first two $_SERVER values are seriously weird: PATH_TRANSLATED /usr/www/users/horigome/fish/chips.html SCRIPT_FILENAME /usr/www/users/horigome/stuff/echo.php The link to the top page (index.htm) is written as ../index.htm, but gets resolved to /stuff/echo.php/index.htm I would be grateful if anyone has any ideas about this, or can suggest the best forum to ask for Apache expertise. |
Brian Chandler |
Aug 4 2024, 01:33 AM
Post
#18
|
Jocular coder Group: Members Posts: 2,480 Joined: 31-August 06 Member No.: 43 |
I am still trying to get to the bottom of this. This page simply echos what Apache thinks it is doing: https://imaginatorium.org/stuff/echo.php Add anything on to the end, and see what happens - just a slash is enough to break everything: https://imaginatorium.org/stuff/echo.php/ Then if you add more, e.g. https://imaginatorium.org/stuff/echo.php/fish/chips.html ...in particular the first two $_SERVER values are seriously weird: PATH_TRANSLATED /usr/www/users/horigome/fish/chips.html SCRIPT_FILENAME /usr/www/users/horigome/stuff/echo.php The link to the top page (index.htm) is written as ../index.htm, but gets resolved to /stuff/echo.php/index.htm I would be grateful if anyone has any ideas about this, or can suggest the best forum to ask for Apache expertise. I think I have resolved this. The key is the Apache configuration setting: AcceptPathInfo. This is all about handling requests with a (cgi - still can't remember exactly what this means) program, rather than directly accessing the file tree on the server. For example, you use some "content management" or similar system called 'fiddle', which you put in the toplevel directory; then all accesses are of the form https://mysite.com/fiddle/fiddle-page-identifier. The server loads the program fiddle, and passes the rest of the url as "path info". The Apache documentation is not terribly helpful, not terribly comprehensive, and not written in terribly grammatical English either, so it does not appear to explain that this also changes the interpretation of relative links. Instead of a link to "/index.html" meaning "go to the location index.html in the top level directory of the domain", it means "carry on using fiddle, and the path info will be /index.html". Here are some references: Apache spec here: https://httpd.apache.org/docs/2.4/mod/core....#AcceptPathInfo ...basically says that not accepting "path info" is the default. https://www.a2hosting.com/kb/developer-corn...info-directive/ ...says: "By default, URLs cannot contain trailing pathname information." Need Apache directive: AcceptPathInfo Off The a2hosting page above also says: "However, some third-party software packages, such as the Moodle course management system, use URLs with pathname information, and will not function correctly." I guess that some user was trying to use Moodle, and it "didn't work"; they discovered that changing AcceptPathInfo to "On" made it work, and didn't bother to understand how this would mess up other users. *** Forgot: I added a directory nopath, and put the AcceptPathInfo Off in .htaccess. So if you go to https://imaginatorium.org/stuff/nopath/echo...fish/chips.html everything is as expected, and you get 404s for all the strange URLs. This post has been edited by Brian Chandler: Aug 4 2024, 01:38 AM |
Brian Chandler |
Aug 7 2024, 08:30 AM
Post
#19
|
Jocular coder Group: Members Posts: 2,480 Joined: 31-August 06 Member No.: 43 |
... Could it be the server is configured to just ignore a slash after a file name and any mumbo jumbo that comes after it and just reload the page? Don't know why it would be, but it's all I can think of. https://imaginatorium.com/ensky.html/qwertyuio.html Interesting... seems that default apache (or browsers?) attitude is to ignore ending backslah. The same cases on my server. Here too! You should try: https://forums.htmlhelp.com/index.php?act=idx idx is variable https://forums.htmlhelp.com/index.php?act=idx/ but the above one also works (incredible!). The next one too: https://forums.htmlhelp.com/index.php?act=idx/abracadabra In your case, you can't see https://imaginatorium.com/ship.php in URI https://imaginatorium.com/ensky.html/ship.php because 'ensky.html' is real valid file which is returned. I am not sure that it has anything to do with htaccess content negotiating which deals with various file types that have the same name In cases of existing real files: https://imaginatorium.com/ensky.html https://imaginatorium.com/ensky.jpg https://imaginatorium.com/ensky.php URI request of https://imaginatorium.com/ensky server will negotiate and return that one which 'he' decide is the proper solution. This works (file is 'analize.html'): http://www.laban.rs/r/a/analize but this don't (file is 'ensky.html' - 404 returned): https://imaginatorium.com/ensky Your content nagotiating is off. What does "Your content nagotiating is off." mean? |
Lo-Fi Version | Time is now: 5th December 2024 - 08:55 AM |