The Web Design Group

... Making the Web accessible to all.

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
> Practical advice on how to work with UTF-8 files
Christian J
post Dec 15 2015, 05:16 PM
Post #1


.
********

Group: WDG Moderators
Posts: 7,637
Joined: 10-August 06
Member No.: 7



Those of you that use UTF-8, what does your workflow look like when creating new HTML files?

I suspect I'll keep forgetting to save new files as UTF-8 in my text editor. When that happens, how can you easily tell if a file was saved as UTF-8 or ANSI (especially if it was saved as UTF-8 without a BOM)? Do you have to look in your text editor's document properties? What if you forget to check, and the web page doesn't contain many exotic characters that may alert you of your mistake?

(I'm thinking of switching to UTF-8 just to please the W3C validator: http://forums.htmlhelp.com/index.php?s=&am...st&p=114090 but otherwise I don't really need it, and fear it will just cause problems.)
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
pandy
post Dec 15 2015, 07:14 PM
Post #2


Don't like donuts. Don't do MySpace.
********

Group: WDG Moderators
Posts: 17,671
Joined: 9-August 06
Member No.: 6



I check what my text editor says it is. It isn't Unicode enabled, so it tells me rather loudly. Among other things it doesn't let me edit the document unless I actively OK that. It also says in the status bar, so I don't need to click anything. But many editors have that feature, I think.

Do you write in Swedish? If you write strictly in English it doesn't matter. You can save as ANSI and serve as UTF-8, but you already know that I guess.

I don't know if there is any real advantage of UTF-8 if you seldom use characters outside the iso latin charset. Can't think of any. I guess it would be that it's extensible, if you in the future want to add some larger quotes in a language outside iso latin you don't have to change anything. Just remap your keyboard - like anyone would do that. biggrin.gif

You'll probably get a better answer from Darin. tongue.gif
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Dec 15 2015, 09:04 PM
Post #3


.
********

Group: WDG Moderators
Posts: 7,637
Joined: 10-August 06
Member No.: 7



QUOTE(pandy @ Dec 16 2015, 01:14 AM) *

I check what my text editor says it is. It isn't Unicode enabled, so it tells me rather loudly. Among other things it doesn't let me edit the document unless I actively OK that.

I think that only happens in mine if the editor font doesn't contain characters found in the file.

QUOTE
It also says in the status bar, so I don't need to click anything. But many editors have that feature, I think.

Seems mine does too, but as soon as you start editing the info goes away. Maybe that's why I've never noticed it before. blush.gif

QUOTE
Do you write in Swedish?

That and English mostly, but you never know...

QUOTE
If you write strictly in English it doesn't matter. You can save as ANSI and serve as UTF-8, but you already know that I guess.

No I didn't know that, thank you! Had to test:

- First I saved a file with Swedish ÅÄÖ characters as UTF-8 (without BOM). When I opened it again, my editor said it was UTF-8.

- Next I removed the ÅÄÖ from the same file and saved it again as UTF-8. When I opened it again, my editor now said it was ANSI.

This might trick you if you remove ÅÄÖ characters from an UTF-8 file, save it, and then add ÅÄÖ back later. When it's time to save it again, my text editor then chooses ANSI by default.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
pandy
post Dec 15 2015, 09:39 PM
Post #4


Don't like donuts. Don't do MySpace.
********

Group: WDG Moderators
Posts: 17,671
Joined: 9-August 06
Member No.: 6



My editor can't save as Unicode, so not a problem. tongue.gif I use another editor if I need to use Unicode.

Yeah, the ASCII range of charcters is encoded the same in ANSI and UTF-8. Very handy, that. Especially with the editor I use. If I need to edit UTF-8 files I'm alright as long as they are in English.

But regarding your problem, can't you set your editor to default to UTF-8 and use that all the time?
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Dec 16 2015, 10:46 AM
Post #5


.
********

Group: WDG Moderators
Posts: 7,637
Joined: 10-August 06
Member No.: 7



QUOTE(pandy @ Dec 16 2015, 03:39 AM) *

But regarding your problem, can't you set your editor to default to UTF-8 and use that all the time?

Seems I can. Will there be any problems doing this for CSS, JS and PHP files too, as long as I stick to ISO Latin characters? Some sites suggest serving these file formats with UTF-8 headers, which sounds like a complication. unsure.gif
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
pandy
post Dec 16 2015, 11:59 AM
Post #6


Don't like donuts. Don't do MySpace.
********

Group: WDG Moderators
Posts: 17,671
Joined: 9-August 06
Member No.: 6



I don't know. I remember reading something about that but it was looong ago. Since I've stuck with iso-latin for files I create myself I haven't given it much thought.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Dec 16 2015, 05:36 PM
Post #7


.
********

Group: WDG Moderators
Posts: 7,637
Joined: 10-August 06
Member No.: 7



If ISO Latin characters are encoded the same in ANSI and UTF-8, like you wrote, I guess there is literally no difference -- it's only if you want to use non-ISO Latin characters in CSS/JS/PHP (e.g. Hieroglyphs for variable names, or scripts that print Swedish text) that you may need to send UTF-8 headers.


User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
pandy
post Dec 16 2015, 06:12 PM
Post #8


Don't like donuts. Don't do MySpace.
********

Group: WDG Moderators
Posts: 17,671
Joined: 9-August 06
Member No.: 6



No, not the whole ANSI range. Just the ASCII range, i.e. a-z, A-Z, 0-9 and common punctuation marks. You can't use ÅÄÖ, ê, ñ, ø, € and so on and save as either ANSI or UTF-8 and serve as whichever you choose. That is, English only!
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Dec 16 2015, 07:50 PM
Post #9


.
********

Group: WDG Moderators
Posts: 7,637
Joined: 10-August 06
Member No.: 7



Oops, I meant ASCII, not ANSI.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post May 12 2017, 07:14 PM
Post #10


.
********

Group: WDG Moderators
Posts: 7,637
Joined: 10-August 06
Member No.: 7



Now I tried converting a web site from iso-8859-1 to UTF-8, using TextPad. At first pages with Swedish text displayed correctly in the browser, but when I viewed source the Swedish letters å, ä and ö were changed into Ã¥, ä and ö, and the file encoding was back to ANSI. How did that happen? unsure.gif

When I saved the above garbled files as UTF-8 a second time, the browser displayed the garbled characters instead of Swedish text. When I changed the garbled characters back to Swedish text and saved as UTF-8 a third time it suddenly worked, so I guess I must have made some mistake, but I'm not sure what.

After some more testing I wonder if it's the pages' META charset (that I forgot to change from iso-8859-1 to UTF-8) that made TextPad act strange --but should a text editor pay attention to HTML tags? After batch correcting the META charsets I batch-saved the files, which now made them become UTF-8 by default, but this time åäö were changed into empty "[]" boxes in the source, and � ("?") characters in my browser. Seems things only work correctly if I save each file individually as UTF-8.

TextPad (v8.1.2) isn't very helpful, BTW. Changing its Default encoding for HTML files in the preferences doesn't seem to work, files still get saved as ANSI by default. Batch conversion doesn't seem possible either, instead you must(?) resave each file separately to change its encoding.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
pandy
post May 12 2017, 09:49 PM
Post #11


Don't like donuts. Don't do MySpace.
********

Group: WDG Moderators
Posts: 17,671
Joined: 9-August 06
Member No.: 6



I have a growing list of those. Can't be bothered to edit the not relevant bits out, you get it. Well, the last isn't Unicode but it's there for a reason.

CODE
^!Replace "Ã¥" >> "å" TWAS
^!Replace "ä" >> "ä" TWAS
^!Replace "ö" >> "ö" TWAS
^!Replace "Ã…" >> "Å" TWAS
^!Replace "Ä" >> "Ä" TWAS
^!Replace "Ö" >> "Ö" TWAS
^!Replace "é" >> "é" TWAS
^!Replace "á" >> "á" TWAS
^!Replace "è" >> "è" TWAS
^!Replace "ü" >> "ü" TWAS
^!Replace "ø" >> "ø" TWAS
^!Replace "æ" >> "æ" TWAS
^!Replace "–" >> "-" TWAS
^!Replace "&" >> "&" TWAS


I know it's a somewhat retarded way to do it but since my Unicode illiterate editor is handy in so many other ways I open unicode files in it and replace the garbled characters with their ANSI counterparts. It of course only takes one click to run the above. The movie player I most often use don't do Unicode either, so it's mostly for subtitles that happens to be in Unicode.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post May 13 2017, 07:04 AM
Post #12


.
********

Group: WDG Moderators
Posts: 7,637
Joined: 10-August 06
Member No.: 7



It does appear TextPad analyzes the META Charset tag. Here are some experiments:

1. I created an HTML file containing a <meta charset="UTF-8"> element and Swedish åäö characters. When saved the file became UTF-8 by default, and everything worked. Correction: now that I tested again, it did not work. UTF-8 is not the default encoding, even though I set it to in the preferences.

2. Created a second identical HTML file, but with a <meta charset="iso-8859-1"> element. When saved the file became ANSI by default. When I changed the META charset to UTF-8 and resaved, the file remained ANSI but Swedish letters were garbled. When I resaved explicitly as UTF-8 everything worked.

3. Created a third identical HTML file, but without any META charset. When saved it became ANSI by default. When I resaved as UTF-8, Swedish letters were garbled. When I added a <meta charset="UTF-8"> element and resaved, the file became UTF-8 and everything worked.

This post has been edited by Christian J: May 14 2017, 08:51 AM
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
pandy
post May 13 2017, 07:21 PM
Post #13


Don't like donuts. Don't do MySpace.
********

Group: WDG Moderators
Posts: 17,671
Joined: 9-August 06
Member No.: 6



I got lost there at some point...
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post May 14 2017, 08:50 AM
Post #14


.
********

Group: WDG Moderators
Posts: 7,637
Joined: 10-August 06
Member No.: 7



Yes, I think I confused myself too. Maybe I should delete that post to protect the innocent...
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post May 14 2017, 12:07 PM
Post #15


.
********

Group: WDG Moderators
Posts: 7,637
Joined: 10-August 06
Member No.: 7



Did some more tests. This makes my head spin, so maybe I got it wrong again.

TextPad's Preferences let you specify different default encodings for various document classes, but it seems the default for text documents affects other document classes too. In other words, UTF-8 as default for text documents will also apply to new documents saved as HTML --TextPad's default for HTML documents has no effect.

Furthermore, if you explicitly specify a non-default encoding when saving an HTML document, TextPad obeys you even if you use the wrong META charset (and characters like "å ä ö" are consequently garbled). Apparently TextPad encodes text differently depending on the META charset. When you open such a document in TextPad, it again seems TextPad lets the META charset decide the encoding:

- A document with <meta charset="UTF-8"> can be explicitly saved as ANSI (which turns "å ä ö" into "? ? ?"), but when you open it again TextPad considers it UTF-8.

- A document with <meta charset="iso-8859-1"> can be explicitly saved as UTF-8 (which turns "å ä ö" into "Ã¥ ä ö"), but when you open it again TextPad considers it ANSI.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
pandy
post May 14 2017, 05:07 PM
Post #16


Don't like donuts. Don't do MySpace.
********

Group: WDG Moderators
Posts: 17,671
Joined: 9-August 06
Member No.: 6



Never used TextPad, so can't be of much help I'm afraid.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post

Reply to this topicStart new topic
2 User(s) are reading this topic (2 Guests and 0 Anonymous Users)
0 Members:

 



- Lo-Fi Version Time is now: 25th September 2017 - 02:55 AM