Printable Version of Topic

Click here to view this topic in its original format

HTMLHelp Forums _ General Web Design _ Practical advice on how to work with UTF-8 files

Posted by: Christian J Dec 15 2015, 05:16 PM

Those of you that use UTF-8, what does your workflow look like when creating new HTML files?

I suspect I'll keep forgetting to save new files as UTF-8 in my text editor. When that happens, how can you easily tell if a file was saved as UTF-8 or ANSI (especially if it was saved as UTF-8 without a BOM)? Do you have to look in your text editor's document properties? What if you forget to check, and the web page doesn't contain many exotic characters that may alert you of your mistake?

(I'm thinking of switching to UTF-8 just to please the W3C validator: http://forums.htmlhelp.com/index.php?s=&showtopic=25841&view=findpost&p=114090 but otherwise I don't really need it, and fear it will just cause problems.)

Posted by: pandy Dec 15 2015, 07:14 PM

I check what my text editor says it is. It isn't Unicode enabled, so it tells me rather loudly. Among other things it doesn't let me edit the document unless I actively OK that. It also says in the status bar, so I don't need to click anything. But many editors have that feature, I think.

Do you write in Swedish? If you write strictly in English it doesn't matter. You can save as ANSI and serve as UTF-8, but you already know that I guess.

I don't know if there is any real advantage of UTF-8 if you seldom use characters outside the iso latin charset. Can't think of any. I guess it would be that it's extensible, if you in the future want to add some larger quotes in a language outside iso latin you don't have to change anything. Just remap your keyboard - like anyone would do that. biggrin.gif

You'll probably get a better answer from Darin. tongue.gif

Posted by: Christian J Dec 15 2015, 09:04 PM

QUOTE(pandy @ Dec 16 2015, 01:14 AM) *

I check what my text editor says it is. It isn't Unicode enabled, so it tells me rather loudly. Among other things it doesn't let me edit the document unless I actively OK that.

I think that only happens in mine if the editor font doesn't contain characters found in the file.

QUOTE
It also says in the status bar, so I don't need to click anything. But many editors have that feature, I think.

Seems mine does too, but as soon as you start editing the info goes away. Maybe that's why I've never noticed it before. blush.gif

QUOTE
Do you write in Swedish?

That and English mostly, but you never know...

QUOTE
If you write strictly in English it doesn't matter. You can save as ANSI and serve as UTF-8, but you already know that I guess.

No I didn't know that, thank you! Had to test:

- First I saved a file with Swedish ÅÄÖ characters as UTF-8 (without BOM). When I opened it again, my editor said it was UTF-8.

- Next I removed the ÅÄÖ from the same file and saved it again as UTF-8. When I opened it again, my editor now said it was ANSI.

This might trick you if you remove ÅÄÖ characters from an UTF-8 file, save it, and then add ÅÄÖ back later. When it's time to save it again, my text editor then chooses ANSI by default.

Posted by: pandy Dec 15 2015, 09:39 PM

My editor can't save as Unicode, so not a problem. tongue.gif I use another editor if I need to use Unicode.

Yeah, the ASCII range of charcters is encoded the same in ANSI and UTF-8. Very handy, that. Especially with the editor I use. If I need to edit UTF-8 files I'm alright as long as they are in English.

But regarding your problem, can't you set your editor to default to UTF-8 and use that all the time?

Posted by: Christian J Dec 16 2015, 10:46 AM

QUOTE(pandy @ Dec 16 2015, 03:39 AM) *

But regarding your problem, can't you set your editor to default to UTF-8 and use that all the time?

Seems I can. Will there be any problems doing this for CSS, JS and PHP files too, as long as I stick to ISO Latin characters? Some sites suggest serving these file formats with UTF-8 headers, which sounds like a complication. unsure.gif

Posted by: pandy Dec 16 2015, 11:59 AM

I don't know. I remember reading something about that but it was looong ago. Since I've stuck with iso-latin for files I create myself I haven't given it much thought.

Posted by: Christian J Dec 16 2015, 05:36 PM

If ISO Latin characters are encoded the same in ANSI and UTF-8, like you wrote, I guess there is literally no difference -- it's only if you want to use non-ISO Latin characters in CSS/JS/PHP (e.g. Hieroglyphs for variable names, or scripts that print Swedish text) that you may need to send UTF-8 headers.



Posted by: pandy Dec 16 2015, 06:12 PM

No, not the whole ANSI range. Just the ASCII range, i.e. a-z, A-Z, 0-9 and common punctuation marks. You can't use ÅÄÖ, ê, ñ, ø, € and so on and save as either ANSI or UTF-8 and serve as whichever you choose. That is, English only!

Posted by: Christian J Dec 16 2015, 07:50 PM

Oops, I meant ASCII, not ANSI.

Posted by: Christian J May 12 2017, 07:14 PM

Now I tried converting a web site from iso-8859-1 to UTF-8, using TextPad. At first pages with Swedish text displayed correctly in the browser, but when I viewed source the Swedish letters å, ä and ö were changed into Ã¥, ä and ö, and the file encoding was back to ANSI. How did that happen? unsure.gif

When I saved the above garbled files as UTF-8 a second time, the browser displayed the garbled characters instead of Swedish text. When I changed the garbled characters back to Swedish text and saved as UTF-8 a third time it suddenly worked, so I guess I must have made some mistake, but I'm not sure what.

After some more testing I wonder if it's the pages' META charset (that I forgot to change from iso-8859-1 to UTF-8) that made TextPad act strange --but should a text editor pay attention to HTML tags? After batch correcting the META charsets I batch-saved the files, which now made them become UTF-8 by default, but this time åäö were changed into empty "[]" boxes in the source, and � ("?") characters in my browser. Seems things only work correctly if I save each file individually as UTF-8.

TextPad (v8.1.2) isn't very helpful, BTW. Changing its Default encoding for HTML files in the preferences doesn't seem to work, files still get saved as ANSI by default. Batch conversion doesn't seem possible either, instead you must(?) resave each file separately to change its encoding.

Posted by: pandy May 12 2017, 09:49 PM

I have a growing list of those. Can't be bothered to edit the not relevant bits out, you get it. Well, the last isn't Unicode but it's there for a reason.

CODE
^!Replace "Ã¥" >> "å" TWAS
^!Replace "ä" >> "ä" TWAS
^!Replace "ö" >> "ö" TWAS
^!Replace "Ã…" >> "Å" TWAS
^!Replace "Ä" >> "Ä" TWAS
^!Replace "Ö" >> "Ö" TWAS
^!Replace "é" >> "é" TWAS
^!Replace "á" >> "á" TWAS
^!Replace "è" >> "è" TWAS
^!Replace "ü" >> "ü" TWAS
^!Replace "ø" >> "ø" TWAS
^!Replace "æ" >> "æ" TWAS
^!Replace "–" >> "-" TWAS
^!Replace "&" >> "&" TWAS


I know it's a somewhat retarded way to do it but since my Unicode illiterate editor is handy in so many other ways I open unicode files in it and replace the garbled characters with their ANSI counterparts. It of course only takes one click to run the above. The movie player I most often use don't do Unicode either, so it's mostly for subtitles that happens to be in Unicode.

Posted by: Christian J May 13 2017, 07:04 AM

It does appear TextPad analyzes the META Charset tag. Here are some experiments:

1. I created an HTML file containing a <meta charset="UTF-8"> element and Swedish åäö characters. When saved the file became UTF-8 by default, and everything worked. Correction: now that I tested again, it did not work. UTF-8 is not the default encoding, even though I set it to in the preferences.

2. Created a second identical HTML file, but with a <meta charset="iso-8859-1"> element. When saved the file became ANSI by default. When I changed the META charset to UTF-8 and resaved, the file remained ANSI but Swedish letters were garbled. When I resaved explicitly as UTF-8 everything worked.

3. Created a third identical HTML file, but without any META charset. When saved it became ANSI by default. When I resaved as UTF-8, Swedish letters were garbled. When I added a <meta charset="UTF-8"> element and resaved, the file became UTF-8 and everything worked.

Posted by: pandy May 13 2017, 07:21 PM

I got lost there at some point...

Posted by: Christian J May 14 2017, 08:50 AM

Yes, I think I confused myself too. Maybe I should delete that post to protect the innocent...

Posted by: Christian J May 14 2017, 12:07 PM

Did some more tests. This makes my head spin, so maybe I got it wrong again.

TextPad's Preferences let you specify different default encodings for various document classes, but it seems the default for text documents affects other document classes too. In other words, UTF-8 as default for text documents will also apply to new documents saved as HTML --TextPad's default for HTML documents has no effect.

Furthermore, if you explicitly specify a non-default encoding when saving an HTML document, TextPad obeys you even if you use the wrong META charset (and characters like "å ä ö" are consequently garbled). Apparently TextPad encodes text differently depending on the META charset. When you open such a document in TextPad, it again seems TextPad lets the META charset decide the encoding:

- A document with <meta charset="UTF-8"> can be explicitly saved as ANSI (which turns "å ä ö" into "? ? ?"), but when you open it again TextPad considers it UTF-8.

- A document with <meta charset="iso-8859-1"> can be explicitly saved as UTF-8 (which turns "å ä ö" into "Ã¥ ä ö"), but when you open it again TextPad considers it ANSI.

Posted by: pandy May 14 2017, 05:07 PM

Never used TextPad, so can't be of much help I'm afraid.

Powered by Invision Power Board (http://www.invisionboard.com)
© Invision Power Services (http://www.invisionpower.com)