Printable Version of Topic

Click here to view this topic in its original format

HTMLHelp Forums _ Client-side Scripting _ Incorrect sorting of the Scandinavian alphabets

Posted by: Christian J Oct 23 2006, 06:11 PM

Not only PHP sorts the Swedish letters å, ä and ö incorrectly, now I noticed that javascript does the same, and also in Danish and Norwegian. The arrays below should be in the correct order for each language:

CODE
window.onload=function()
{    
    var se=['å','ä','ö']; // Swedish
    var dk=['æ','ø','å']; // Danish, apparently same as Norwegian
    
    alert(se.sort());
    alert(dk.sort());    
}


Note that Danish and Norwegian use a different order than Swedish. But in the sorted javascript alerts the Swedish letters are incorrectly sorted as "ä,å,ö", while Danish and Norwegian are (again incorrectly) sorted as "å,æ,ø". The same error appear in IE, Opera and Firefox. At least Opera's Norwegian creators should know their own alphabet, so am I correct in assuming that all three browser vendors deliberately follow some flawed convention?

Posted by: Darin McGrew Oct 23 2006, 06:55 PM

Does PHP allow you to specify the locale? The default locale is often "C", which sorts characters according to their numeric encoding. Other locales should sort characters as appropriate for that locale.

Posted by: Liam Quinn Oct 23 2006, 08:03 PM

The default sort algorithm in JavaScript is based purely on the Unicode code point. If you want a locale-sensitive sort order, you can use this:

CODE

function localeSort(string1, string2) {
  return string1.toString().localeCompare(string2.toString());
}

var se=['å','ä','ö']; // Swedish
var dk=['æ','ø','å']; // Danish, apparently same as Norwegian

alert(se.sort(localeSort));
alert(dk.sort(localeSort));


That should use the locale configured on the user's system. If you want to use a specific locale regardless of the user's locale, I think you're stuck with writing the code for the locale-specific rules yourself in the function you pass to sort().

Posted by: Christian J Oct 24 2006, 05:48 AM

QUOTE(Darin McGrew @ Oct 24 2006, 01:55 AM) *

Does PHP allow you to specify the locale?

It does, but it seems to be buggy. The entry on http://bugs.php.net/bug.php?id=9671 (10 Mar 2001 1:36pm) suggests something like this, which still sorts in the wrong order (PHP 4.3.3):

CODE
<?php
// Danish letters
$dk = array('ø', 'æ', 'å');
setlocale(LC_COLLATE, "dk_DK");
usort($dk, "strcoll");
print_r($dk); // returns "Array ( [0] => å [1] => æ [2] => ø )"

echo '<br>';

// Norwegian letters
$no = array('ø', 'æ', 'å');
setlocale(LC_COLLATE, "no_NO");
usort($no, "strcoll");
print_r($no); // returns "Array ( [0] => å [1] => æ [2] => ø )"

echo '<br>';

// Swedish letters
$se = array('å', 'ä', 'ö');
setlocale(LC_COLLATE, "sv_SV");
usort($se, "strcoll");
print_r($se); // returns "Array ( [0] => ä [1] => å [2] => ö )"
?>

Posted by: Christian J Oct 24 2006, 07:46 AM

QUOTE(Liam Quinn @ Oct 24 2006, 03:03 AM) *

The default sort algorithm in JavaScript is based purely on the Unicode code point.

According to http://en.wikipedia.org/wiki/Unicode#Origin_and_development the first 256 code points are identical to http://en.wikipedia.org/wiki/ISO_8859-1#Code_table, and there you can indeed find "ä" before "å".

QUOTE
If you want a locale-sensitive sort order, you can use this:
CODE

function localeSort(string1, string2) {
  return string1.toString().localeCompare(string2.toString());
}

var se=['å','ä','ö']; // Swedish
var dk=['æ','ø','å']; // Danish, apparently same as Norwegian

alert(se.sort(localeSort));
alert(dk.sort(localeSort));


That should use the locale configured on the user's system.

Do you mean the user's OS or browser language settings? On my Swedish Win XP it seems to work in IE6 and Firefox, but Opera sorts like before (despite claiming to support the localeCompare() method from Op7).

QUOTE
If you want to use a specific locale regardless of the user's locale...

Regarding usability: what if a non-Swedish user reads a Swedish web page, wouldn't they (as I believe) expect letters to be sorted according to their own habit? E.g., wouldn't a typical English-speaking user expect "å" and "ä" to be treated as "a", and "ö" to be treated as "o"?

Posted by: Darin McGrew Oct 24 2006, 01:49 PM

QUOTE
wouldn't a typical English-speaking user expect "å" and "ä" to be treated as "a", and "ö" to be treated as "o"?
I can't say whether I'm "a typical English-speaking user", but I would expect a Swedish page to sort Swedish names according to the normal Swedish rules for alphabetizing names. I would expect Danish and Norwegian pages to use the Danish and Norwegian alphabetizing rules (respectively). And so on.

Posted by: Christian J Oct 24 2006, 02:21 PM

QUOTE(Darin McGrew @ Oct 24 2006, 08:49 PM) *

QUOTE
wouldn't a typical English-speaking user expect "å" and "ä" to be treated as "a", and "ö" to be treated as "o"?
I can't say whether I'm "a typical English-speaking user", but I would expect a Swedish page to sort Swedish names according to the normal Swedish rules for alphabetizing names. I would expect Danish and Norwegian pages to use the Danish and Norwegian alphabetizing rules (respectively). And so on.


But what if you (the English-speaking user) don't know the Swedish rules? Suppose you're looking for a name like "Åsa" or "Örjan" in a very long alphabetically sorted list, were in the list would you begin to look?

Posted by: Darin McGrew Oct 24 2006, 02:36 PM

QUOTE(Christian J @ Oct 24 2006, 12:21 PM) *
But what if you (the English-speaking user) don't know the Swedish rules? Suppose you're looking for a name like "Åsa" or "Örjan" in a very long alphabetically sorted list, were in the list would you begin to look?
Here's where I know I'm atypical: I'd look for an index of some sort. If the index listed ABCDEFGHIJKLMNOPQRSTUVWXYZ, then I'd look under A or O. But if the index listed ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ, then I'd look under Å or Ö.

Posted by: Liam Quinn Oct 24 2006, 07:48 PM

QUOTE(Christian J @ Oct 24 2006, 06:48 AM) *

QUOTE(Darin McGrew @ Oct 24 2006, 01:55 AM) *

Does PHP allow you to specify the locale?

It does, but it seems to be buggy. The entry on http://bugs.php.net/bug.php?id=9671 (10 Mar 2001 1:36pm) suggests something like this, which still sorts in the wrong order (PHP 4.3.3):

CODE
<?php
// Danish letters
$dk = array('ø', 'æ', 'å');
setlocale(LC_COLLATE, "dk_DK");
usort($dk, "strcoll");
print_r($dk); // returns "Array ( [0] => å [1] => æ [2] => ø )"

echo '<br>';

// Norwegian letters
$no = array('ø', 'æ', 'å');
setlocale(LC_COLLATE, "no_NO");
usort($no, "strcoll");
print_r($no); // returns "Array ( [0] => å [1] => æ [2] => ø )"

echo '<br>';

// Swedish letters
$se = array('å', 'ä', 'ö');
setlocale(LC_COLLATE, "sv_SV");
usort($se, "strcoll");
print_r($se); // returns "Array ( [0] => ä [1] => å [2] => ö )"
?>



The user comments in http://ca3.php.net/setlocale may help you determine whether your system has the locales installed. One problem is that you have the Danish and Swedish locale codes wrong: They should be "da_DK" and "sv_SE" (language_COUNTRY).

Posted by: Liam Quinn Oct 24 2006, 08:05 PM

QUOTE(Christian J @ Oct 24 2006, 08:46 AM) *

QUOTE
If you want a locale-sensitive sort order, you can use this:
CODE

function localeSort(string1, string2) {
  return string1.toString().localeCompare(string2.toString());
}

var se=['å','ä','ö']; // Swedish
var dk=['æ','ø','å']; // Danish, apparently same as Norwegian

alert(se.sort(localeSort));
alert(dk.sort(localeSort));


That should use the locale configured on the user's system.

Do you mean the user's OS or browser language settings?


I think that's up to the browser implementation.

QUOTE

Regarding usability: what if a non-Swedish user reads a Swedish web page, wouldn't they (as I believe) expect letters to be sorted according to their own habit? E.g., wouldn't a typical English-speaking user expect "å" and "ä" to be treated as "a", and "ö" to be treated as "o"?


If the page is in Swedish, I think you should assume that the reader knows Swedish and that Swedish sorting rules are appropriate.

Posted by: Christian J Oct 25 2006, 07:04 AM

QUOTE(Liam Quinn @ Oct 25 2006, 02:48 AM) *

One problem is that you have the Danish and Swedish locale codes wrong: They should be "da_DK" and "sv_SE" (language_COUNTRY).

The locale codes indeed seem to be the problem. Like http://ca3.php.net/setlocale says, different systems have different naming schemes for locales, but you can apparently use an array of codes. The following works both on my Apache/Windows test server and on my web host's FreeBSD:

CODE
setlocale(LC_COLLATE, "sve", "sv_SE.ISO8859-1");

But while "nor" and "dan" work for Norwegian and Danish on my Apache/Windows, I haven't been able to make any code work for them on my web host yet. E.g., even though the following echoes "da_DK.ISO8859-1" as the preferred locale, it doesn't sort properly:

CODE
<?php
// Danish letters
$dk = array('a', 'b', 'o', 'æ', 'ø', 'å');
setlocale(LC_COLLATE, "dan", "da_DK.ISO8859-1");
echo setlocale(LC_COLLATE, "dan", "da_DK.ISO8859-1").'<br>';  // "da_DK.ISO8859-1"
usort($dk, "strcoll");  
print_r($dk); // "Array ( [0] => a [1] => å [2] => æ [3] => b [4] => o [5] => ø )"
?>


I should add that Norwegian and Danish is not an urgent problem, I'm mostly curious.

Posted by: pandy Oct 25 2006, 06:02 PM

QUOTE(Christian J @ Oct 24 2006, 02:46 PM) *

According to http://en.wikipedia.org/wiki/Unicode#Origin_and_development the first 256 code points are identical to http://en.wikipedia.org/wiki/ISO_8859-1#Code_table, and there you can indeed find "ä" before "å".

It's only ASCII characters that are encoded the same in Unicode, isn't it? ÅÄÖ are 0197, 0196 and 0214 in Unicode so indeed Ä comes first.

Posted by: Brian Chandler Oct 26 2006, 12:32 AM

QUOTE(Christian J @ Oct 25 2006, 09:04 PM) *

QUOTE(Liam Quinn @ Oct 25 2006, 02:48 AM) *

One problem is that you have the Danish and Swedish locale codes wrong: They should be "da_DK" and "sv_SE" (language_COUNTRY).

The locale codes indeed seem to be the problem. Like http://ca3.php.net/setlocale says, different systems have different naming schemes for locales, but you can apparently use an array of codes. The following works both on my Apache/Windows test server and on my web host's FreeBSD:

CODE
setlocale(LC_COLLATE, "sve", "sv_SE.ISO8859-1");

But while "nor" and "dan" work for Norwegian and Danish on my Apache/Windows, I haven't been able to make any code work for them on my web host yet. E.g., even though the following echoes "da_DK.ISO8859-1" as the preferred locale, it doesn't sort properly:

CODE
<?php
// Danish letters
$dk = array('a', 'b', 'o', 'æ', 'ø', 'å');
setlocale(LC_COLLATE, "dan", "da_DK.ISO8859-1");
echo setlocale(LC_COLLATE, "dan", "da_DK.ISO8859-1").'<br>';  // "da_DK.ISO8859-1"
usort($dk, "strcoll");  
print_r($dk); // "Array ( [0] => a [1] => å [2] => æ [3] => b [4] => o [5] => ø )"
?>


I should add that Norwegian and Danish is not an urgent problem, I'm mostly curious.


I'm still a bit puzzled about this. At what point does a web server actually _sort_ anything?

Powered by Invision Power Board (http://www.invisionboard.com)
© Invision Power Services (http://www.invisionpower.com)