The Web Design Group

... Making the Web accessible to all.

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
> Apache problem, Weird addresses being served...
Brian Chandler
post Jul 14 2024, 10:28 AM
Post #1


Jocular coder
********

Group: Members
Posts: 2,476
Joined: 31-August 06
Member No.: 43



I suddenly got a warning from Pair Networks that my "bandwidth" figure was several hundred times normal, with a projected 4-digit (dollar) bill. I managed to track down the problem, and Pair agreed to waive any surcharge, so good for them. But the underlying problem is weird. The problem page was https://imaginatorium.org/sano/tanbo.htm - which I made 20+ years ago.

But to go back to the beginning, the weirdness is this. Try the page: https://imaginatorium.com/ensky.html - should work. Now try https://imaginatorium.com/ensky.html/ship.php - I expect this to fail, since there is no file called ensky.html/ship.php in the relevant directory. But Apache simply serves the same page: it appears to go through the url until it finds an existing file, ignoring the rest of the string, BUT treating the last slash in the url as marking the "current directory". So if you click any of the links on this page, for example "Shop front" it goes to https://imaginatorium.com/ensky.html/shop.html - and this of course returns the current page all over again.

My problem page included an iframe (remember that???) including an image; this is the scrolling panorama near the top, which I have just reimplemented using css. The old version is commented out; here it is:

QUOTE
<center>
<iframe width="80%" height=196 src="pics/b045pano.htm" marginheight=0 marginwidth=0><a href="pics/b045pano.jpg">Panorama</a></iframe>
<p class=caption>Bare paddy-fields - 250-degree panorama of the Kanto Plain in winter</p>
</center>


So what happened was a (genuine bot) access to "GET /sano/tanbo.htm/art/art/books/art/books/books/pics/art/pics/web/web/guest/pics/guest/art/pics/pics/stuff.htm HTTP/1.1" 200 12311 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)". This served the page, then inside the iframe served the same page again with a different (ignored) bit on the end, and so on, recursing indefinitely.

So what is going on? I cannot believe that this was the intention of the http designers. I tried the Apache documentation, but could not anything resembling a specification, just a vague statemement about the file tree, and lots of "Try this, it may work for you" type stuff. Is this standard Apache behaviour, or could it be some problem with the Pair.com implementation? Can someone try the same trick on their own server? (This test will not work on the original paddy-field page, because I put this in .htaccess ...)

QUOTE
DirectoryIndex sano.htm index.htm
AddType application/x-httpd-php .php .htm
#
RewriteEngine On
RewriteBase /
# Block bogus tanbo.htm/art/stuff/... and deliver 403 access denied
RewriteCond %{REQUEST_URI} \.htm/
RewriteRule \.htm/ - [F,L]


Grateful for suggestions: if I can work out whether it is specific to pair.com I can go either to them or to Apache support...
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
pandy
post Jul 14 2024, 10:43 AM
Post #2


🌟Computer says no🌟
********

Group: WDG Moderators
Posts: 20,758
Joined: 9-August 06
Member No.: 6



QUOTE(Brian Chandler @ Jul 14 2024, 05:28 PM) *



But to go back to the beginning, the weirdness is this. Try the page: https://imaginatorium.com/ensky.html - should work. Now try https://imaginatorium.com/ensky.html/ship.php - I expect this to fail, since there is no file called ensky.html/ship.php in the relevant directory. But Apache simply serves the same page: it appears to go through the url until it finds an existing file, ignoring the rest of the string, BUT treating the last slash in the url as marking the "current directory". So if you click any of the links on this page, for example "Shop front" it goes to https://imaginatorium.com/ensky.html/shop.html - and this of course returns the current page all over again.


I don't know. https://imaginatorium.com/ensky.html/ship.php isn't a valid URL (can't do the slash thing after a file name). Could it be the server is configured to just ignore a slash after a file name and any mumbo jumbo that comes after it and just reload the page? Don't know why it would be, but it's all I can think of.
https://imaginatorium.com/ensky.html/
https://imaginatorium.com/ensky.html/qwertyuio
https://imaginatorium.com/ensky.html/qwertyuio.html
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Jul 14 2024, 08:11 PM
Post #3


.
********

Group: WDG Moderators
Posts: 9,725
Joined: 10-August 06
Member No.: 7



QUOTE(Brian Chandler @ Jul 14 2024, 05:28 PM) *

Now try https://imaginatorium.com/ensky.html/ship.php - I expect this to fail, since there is no file called ensky.html/ship.php in the relevant directory. But Apache simply serves the same page: it appears to go through the url until it finds an existing file, ignoring the rest of the string, BUT treating the last slash in the url as marking the "current directory".

I have a host that suddenly enabled "content negotiation" (is that the term?) on my site without asking or telling, IIRC making URLs like example.com/foo return example.com/foo.html (instead of example.com/foo/ as usual). Could it be that your host is using something similar, but more advanced?

QUOTE
Can someone try the same trick on their own server?

I can check tomorrow if I remember, but there seem to be lots of different Apache configurations these days, and I have no idea where the documentation for them is. For example, the same host as above tends to make my htaccess directives stop working every time they change their Unix server software.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Jul 14 2024, 08:21 PM
Post #4


.
********

Group: WDG Moderators
Posts: 9,725
Joined: 10-August 06
Member No.: 7



QUOTE(pandy @ Jul 14 2024, 05:43 PM) *

I don't know. https://imaginatorium.com/ensky.html/ship.php isn't a valid URL (can't do the slash thing after a file name).

In other words you can't use period signs in directory names (like you can with Windows folders, in this case the folder "ensky.html")? unsure.gif
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
pandy
post Jul 14 2024, 08:49 PM
Post #5


🌟Computer says no🌟
********

Group: WDG Moderators
Posts: 20,758
Joined: 9-August 06
Member No.: 6



Is ensky.html a directory and not a HTML document? unsure.gif
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Dag
post Jul 15 2024, 02:32 AM
Post #6


Advanced Member
****

Group: Members
Posts: 122
Joined: 24-October 06
Member No.: 549



QUOTE(pandy @ Jul 14 2024, 07:43 PM) *


... Could it be the server is configured to just ignore a slash after a file name and any mumbo jumbo that comes after it and just reload the page? Don't know why it would be, but it's all I can think of.
https://imaginatorium.com/ensky.html/qwertyuio.html


Interesting... seems that default apache (or browsers?) attitude is to ignore ending backslah. The same cases on my server. Here too!

You should try:
https://forums.htmlhelp.com/index.php?act=idx
idx is variable
https://forums.htmlhelp.com/index.php?act=idx/
but the above one also works (incredible!). The next one too:
https://forums.htmlhelp.com/index.php?act=idx/abracadabra

In your case, you can't see
https://imaginatorium.com/ship.php
in URI
https://imaginatorium.com/ensky.html/ship.php
because 'ensky.html' is real valid file which is returned.

I am not sure that it has anything to do with htaccess content negotiating which deals with various file types that have the same name

In cases of existing real files:
https://imaginatorium.com/ensky.html
https://imaginatorium.com/ensky.jpg
https://imaginatorium.com/ensky.php

URI request of
https://imaginatorium.com/ensky
server will negotiate and return that one which 'he' decide is the proper solution.

This works (file is 'analize.html'):
http://www.laban.rs/r/a/analize
but this don't (file is 'ensky.html' - 404 returned):
https://imaginatorium.com/ensky

Your content nagotiating is off.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Jul 15 2024, 07:32 AM
Post #7


.
********

Group: WDG Moderators
Posts: 9,725
Joined: 10-August 06
Member No.: 7



QUOTE(pandy @ Jul 15 2024, 03:49 AM) *

Is ensky.html a directory and not a HTML document? unsure.gif

In Brian's case it's obviously an HTML document. But when followed by a slash it looks like a directory. I don't know if it's valid URL though. unsure.gif

User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
pandy
post Jul 15 2024, 09:53 AM
Post #8


🌟Computer says no🌟
********

Group: WDG Moderators
Posts: 20,758
Joined: 9-August 06
Member No.: 6



QUOTE(Dag @ Jul 15 2024, 09:32 AM) *



Interesting... seems that default apache (or browsers?) attitude is to ignore ending backslah. The same cases on my server. Here too!

You should try:
https://forums.htmlhelp.com/index.php?act=idx
idx is variable
https://forums.htmlhelp.com/index.php?act=idx/
but the above one also works (incredible!). The next one too:
https://forums.htmlhelp.com/index.php?act=idx/abracadabra



But it's part of the query string in those examples. With the construction Brian uses we get a 404.
https://htmlhelp.com/reference/html40/entit.../qwertyuio.html

User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
pandy
post Jul 15 2024, 10:08 AM
Post #9


🌟Computer says no🌟
********

Group: WDG Moderators
Posts: 20,758
Joined: 9-August 06
Member No.: 6



QUOTE(Christian J @ Jul 15 2024, 02:32 PM) *

QUOTE(pandy @ Jul 15 2024, 03:49 AM) *

Is ensky.html a directory and not a HTML document? unsure.gif

In Brian's case it's obviously an HTML document. But when followed by a slash it looks like a directory. I don't know if it's valid URL though. unsure.gif


Surely periods in file and folder names are allowed on *nix too. Question is how to construct an URL to them to avoid they are seen as filename.ext. If you have an URL like this

http:/exampel.com/valid.folder/


Will the server pick the index file in valid.folder or will it ignore the ending slash and try to find a file called valid with the extension folder?

But what I meant with not valid in this case is that AFIK a slash can't be used in the way Brian uses it https://imaginatorium.com/ensky.html/ship.php . What's that last slash even meant to mean?
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Jul 15 2024, 10:30 AM
Post #10


.
********

Group: WDG Moderators
Posts: 9,725
Joined: 10-August 06
Member No.: 7



QUOTE(Brian Chandler @ Jul 14 2024, 05:28 PM) *

Can someone try the same trick on their own server?

Both my webhost and my local XAMPP test server seem to do the same. This:

CODE
localhost/foo/existing-file.html/non-existing-file.html

returns the content of this HTML file

CODE
localhost/foo/existing-file.html



User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Jul 15 2024, 10:35 AM
Post #11


.
********

Group: WDG Moderators
Posts: 9,725
Joined: 10-August 06
Member No.: 7



QUOTE(pandy @ Jul 15 2024, 05:08 PM) *

Surely periods in file and folder names are allowed on .nix too. Question is how to construct an URL to them to avoid they are seen as filename.ext. If you have an URL like this

http:/exampel.com/valid.folder/


Will the server pick the index file in valid.folder

That's what I would assume. Doesn't the ending slash indicate that "valid.folder" should be a folder? If no such folder exists, the server should return a 404.

QUOTE
or will it ignore the ending slash and try to find a file called valid with the extension folder?

That's what seems to happen in practice, but it doesn't make sense to me. Maybe it's another example of software trying to be "helpful". wacko.gif

QUOTE
But what I meant with not valid in this case is that AFIK a slash can't be used in the way Brian uses it https://imaginatorium.com/ensky.html/ship.php . What's that last slash even meant to mean?

I assume that's a buggy URL. If it was intentional, "ensky.html" would have to be a folder (obviously it would be a confusing name for a folder).
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Brian Chandler
post Jul 15 2024, 11:40 AM
Post #12


Jocular coder
********

Group: Members
Posts: 2,476
Joined: 31-August 06
Member No.: 43



Thanks for responses. Some points in no particular order...

In Unix a directory name can be almost anything, including fred.html; the only prohibited character is / (slash).

I believe that https://any.domain.name followed by almost anything, separated by / ? or perhaps other non-alphanumeric characters is a "valid" URL - it is up to the server to interpret it as it wishes. And in Dag's example https://forums.htmlhelp.com/index.php?act=idx/abracadabra the argument string should result in a GET variable with a value of "idx/abracadabra", so again it is only a question of how the server parses this. If it does an initial string match for 'idx' then 'idx/abracadabra' will act as 'idx'; if you check whether the GET argument is exactly 'idx', then it should fail.

But anyway, the problem is that the Apache behaviour I am getting is wildly different from what I expect, and I don't know if this is simply Apache "more-or-less normal", caused by some strange setting on the Pair server, caused by my .htaccess settings, or just a bug. And whether Apache has a policy of responding to support requests by sorting out bugs, or whether it's just the wild west. Because it is a shared server, I have to put .htaccess files in each directory, which slows down the server; it has to look at each directory in term, so logically it might be parsing the file path, rather than just throwing the whole filename at the file system. Even so, it should not be treating the name of the currect directory as different from the actual directory from which it got the file.

The people at Pair did not say much, but something about having error handling directives for 403 and 404, and how "recently" they have seen more of these circling URLs including chains of bogus directories.

Anyway, I would be grateful if someone else can check a different server with the recursive iframe example. Don't leave it there for long, because if a bot finds it you risk a large surcharge.

User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Jul 15 2024, 03:34 PM
Post #13


.
********

Group: WDG Moderators
Posts: 9,725
Joined: 10-August 06
Member No.: 7



QUOTE(Brian Chandler @ Jul 15 2024, 06:40 PM) *

I don't know if this is simply Apache "more-or-less normal", caused by some strange setting on the Pair server, caused by my .htaccess settings, or just a bug.

I suppose neither the webhost nor your .htaccess file are responsible, since I got the same result in XAMPP.

Can't say if it's an Apache bug without knowing what the spec says (which I don't). Is it this one? unsure.gif https://www.rfc-editor.org/rfc/rfc3986

Maybe someone with a non-Apache server could test as well...

QUOTE
And whether Apache has a policy of responding to support requests by sorting out bugs, or whether it's just the wild west.

Seems there are a few Apache forums, someone there might know.

QUOTE
"recently" they have seen more of these circling URLs including chains of bogus directories.

Something must have caused such a recent change. Maybe MJ12bot has become buggy (or more intrusive) lately? I've also read that scraper bots are disguising themselves with that "MJ12bot" name.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
pandy
post Jul 15 2024, 06:37 PM
Post #14


🌟Computer says no🌟
********

Group: WDG Moderators
Posts: 20,758
Joined: 9-August 06
Member No.: 6



The behavior is not the same on this domain.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Jul 15 2024, 08:13 PM
Post #15


.
********

Group: WDG Moderators
Posts: 9,725
Joined: 10-August 06
Member No.: 7



You're right, this gives a 404:

CODE
https://htmlhelp.com/reference/html40/structure.html/foo.html

Good idea to check with any server, no need to just use our own.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Brian Chandler
post Jul 15 2024, 11:20 PM
Post #16


Jocular coder
********

Group: Members
Posts: 2,476
Joined: 31-August 06
Member No.: 43



QUOTE(Christian J @ Jul 16 2024, 05:34 AM) *

QUOTE(Brian Chandler @ Jul 15 2024, 06:40 PM) *

I don't know if this is simply Apache "more-or-less normal", caused by some strange setting on the Pair server, caused by my .htaccess settings, or just a bug.

I suppose neither the webhost nor your .htaccess file are responsible, since I got the same result in XAMPP.

Can't say if it's an Apache bug without knowing what the spec says (which I don't). Is it this one? unsure.gif https://www.rfc-editor.org/rfc/rfc3986

...


Quote: "This specification does not mandate a particular registered name lookup technology and therefore does not restrict the syntax of reg-name beyond what is necessary for interoperability."

In other words all of what we are talking about are valid URLs; the question is about how Apache should do with them. I have a fix: add a .htaccess rewrite rule to replace any of .html/ .php/ etc etc by a 503 response. But this seems crazy, me having to patch against non-standard behaviour.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Brian Chandler
post Jul 20 2024, 11:00 AM
Post #17


Jocular coder
********

Group: Members
Posts: 2,476
Joined: 31-August 06
Member No.: 43



I am still trying to get to the bottom of this. This page simply echos what Apache thinks it is doing:
https://imaginatorium.org/stuff/echo.php

Add anything on to the end, and see what happens - just a slash is enough to break everything:
https://imaginatorium.org/stuff/echo.php/

Then if you add more, e.g.
https://imaginatorium.org/stuff/echo.php/fish/chips.html

...in particular the first two $_SERVER values are seriously weird:

PATH_TRANSLATED /usr/www/users/horigome/fish/chips.html
SCRIPT_FILENAME /usr/www/users/horigome/stuff/echo.php

The link to the top page (index.htm) is written as ../index.htm, but gets resolved to /stuff/echo.php/index.htm

I would be grateful if anyone has any ideas about this, or can suggest the best forum to ask for Apache expertise.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Brian Chandler
post Aug 4 2024, 01:33 AM
Post #18


Jocular coder
********

Group: Members
Posts: 2,476
Joined: 31-August 06
Member No.: 43



QUOTE(Brian Chandler @ Jul 21 2024, 01:00 AM) *

I am still trying to get to the bottom of this. This page simply echos what Apache thinks it is doing:
https://imaginatorium.org/stuff/echo.php

Add anything on to the end, and see what happens - just a slash is enough to break everything:
https://imaginatorium.org/stuff/echo.php/

Then if you add more, e.g.
https://imaginatorium.org/stuff/echo.php/fish/chips.html

...in particular the first two $_SERVER values are seriously weird:

PATH_TRANSLATED /usr/www/users/horigome/fish/chips.html
SCRIPT_FILENAME /usr/www/users/horigome/stuff/echo.php

The link to the top page (index.htm) is written as ../index.htm, but gets resolved to /stuff/echo.php/index.htm

I would be grateful if anyone has any ideas about this, or can suggest the best forum to ask for Apache expertise.


I think I have resolved this. The key is the Apache configuration setting: AcceptPathInfo. This is all about handling requests with a (cgi - still can't remember exactly what this means) program, rather than directly accessing the file tree on the server. For example, you use some "content management" or similar system called 'fiddle', which you put in the toplevel directory; then all accesses are of the form https://mysite.com/fiddle/fiddle-page-identifier. The server loads the program fiddle, and passes the rest of the url as "path info". The Apache documentation is not terribly helpful, not terribly comprehensive, and not written in terribly grammatical English either, so it does not appear to explain that this also changes the interpretation of relative links. Instead of a link to "/index.html" meaning "go to the location index.html in the top level directory of the domain", it means "carry on using fiddle, and the path info will be /index.html".

Here are some references:

Apache spec here: https://httpd.apache.org/docs/2.4/mod/core....#AcceptPathInfo
...basically says that not accepting "path info" is the default.

https://www.a2hosting.com/kb/developer-corn...info-directive/
...says: "By default, URLs cannot contain trailing pathname information."

Need Apache directive:
AcceptPathInfo Off

The a2hosting page above also says: "However, some third-party software packages, such as the Moodle course management system, use URLs with pathname information, and will not function correctly."

I guess that some user was trying to use Moodle, and it "didn't work"; they discovered that changing AcceptPathInfo to "On" made it work, and didn't bother to understand how this would mess up other users.

*** Forgot: I added a directory nopath, and put the AcceptPathInfo Off in .htaccess. So if you go to https://imaginatorium.org/stuff/nopath/echo...fish/chips.html everything is as expected, and you get 404s for all the strange URLs.

This post has been edited by Brian Chandler: Aug 4 2024, 01:38 AM
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Brian Chandler
post Aug 7 2024, 08:30 AM
Post #19


Jocular coder
********

Group: Members
Posts: 2,476
Joined: 31-August 06
Member No.: 43



QUOTE(Dag @ Jul 15 2024, 04:32 PM) *

QUOTE(pandy @ Jul 14 2024, 07:43 PM) *


... Could it be the server is configured to just ignore a slash after a file name and any mumbo jumbo that comes after it and just reload the page? Don't know why it would be, but it's all I can think of.
https://imaginatorium.com/ensky.html/qwertyuio.html


Interesting... seems that default apache (or browsers?) attitude is to ignore ending backslah. The same cases on my server. Here too!

You should try:
https://forums.htmlhelp.com/index.php?act=idx
idx is variable
https://forums.htmlhelp.com/index.php?act=idx/
but the above one also works (incredible!). The next one too:
https://forums.htmlhelp.com/index.php?act=idx/abracadabra

In your case, you can't see
https://imaginatorium.com/ship.php
in URI
https://imaginatorium.com/ensky.html/ship.php
because 'ensky.html' is real valid file which is returned.

I am not sure that it has anything to do with htaccess content negotiating which deals with various file types that have the same name

In cases of existing real files:
https://imaginatorium.com/ensky.html
https://imaginatorium.com/ensky.jpg
https://imaginatorium.com/ensky.php

URI request of
https://imaginatorium.com/ensky
server will negotiate and return that one which 'he' decide is the proper solution.

This works (file is 'analize.html'):
http://www.laban.rs/r/a/analize
but this don't (file is 'ensky.html' - 404 returned):
https://imaginatorium.com/ensky

Your content nagotiating is off.


What does "Your content nagotiating is off." mean?
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post

Reply to this topicStart new topic
2 User(s) are reading this topic (2 Guests and 0 Anonymous Users)
0 Members:

 



- Lo-Fi Version Time is now: 12th October 2024 - 09:14 PM