Help - Search - Members - Calendar
Full Version: Practical advice on how to work with UTF-8 files
HTMLHelp Forums > Web Authoring > General Web Design
Christian J
Those of you that use UTF-8, what does your workflow look like when creating new HTML files?

I suspect I'll keep forgetting to save new files as UTF-8 in my text editor. When that happens, how can you easily tell if a file was saved as UTF-8 or ANSI (especially if it was saved as UTF-8 without a BOM)? Do you have to look in your text editor's document properties? What if you forget to check, and the web page doesn't contain many exotic characters that may alert you of your mistake?

(I'm thinking of switching to UTF-8 just to please the W3C validator: http://forums.htmlhelp.com/index.php?s=&am...st&p=114090 but otherwise I don't really need it, and fear it will just cause problems.)
pandy
I check what my text editor says it is. It isn't Unicode enabled, so it tells me rather loudly. Among other things it doesn't let me edit the document unless I actively OK that. It also says in the status bar, so I don't need to click anything. But many editors have that feature, I think.

Do you write in Swedish? If you write strictly in English it doesn't matter. You can save as ANSI and serve as UTF-8, but you already know that I guess.

I don't know if there is any real advantage of UTF-8 if you seldom use characters outside the iso latin charset. Can't think of any. I guess it would be that it's extensible, if you in the future want to add some larger quotes in a language outside iso latin you don't have to change anything. Just remap your keyboard - like anyone would do that. biggrin.gif

You'll probably get a better answer from Darin. tongue.gif
Christian J
QUOTE(pandy @ Dec 16 2015, 01:14 AM) *

I check what my text editor says it is. It isn't Unicode enabled, so it tells me rather loudly. Among other things it doesn't let me edit the document unless I actively OK that.

I think that only happens in mine if the editor font doesn't contain characters found in the file.

QUOTE
It also says in the status bar, so I don't need to click anything. But many editors have that feature, I think.

Seems mine does too, but as soon as you start editing the info goes away. Maybe that's why I've never noticed it before. blush.gif

QUOTE
Do you write in Swedish?

That and English mostly, but you never know...

QUOTE
If you write strictly in English it doesn't matter. You can save as ANSI and serve as UTF-8, but you already know that I guess.

No I didn't know that, thank you! Had to test:

- First I saved a file with Swedish ÅÄÖ characters as UTF-8 (without BOM). When I opened it again, my editor said it was UTF-8.

- Next I removed the ÅÄÖ from the same file and saved it again as UTF-8. When I opened it again, my editor now said it was ANSI.

This might trick you if you remove ÅÄÖ characters from an UTF-8 file, save it, and then add ÅÄÖ back later. When it's time to save it again, my text editor then chooses ANSI by default.
pandy
My editor can't save as Unicode, so not a problem. tongue.gif I use another editor if I need to use Unicode.

Yeah, the ASCII range of charcters is encoded the same in ANSI and UTF-8. Very handy, that. Especially with the editor I use. If I need to edit UTF-8 files I'm alright as long as they are in English.

But regarding your problem, can't you set your editor to default to UTF-8 and use that all the time?
Christian J
QUOTE(pandy @ Dec 16 2015, 03:39 AM) *

But regarding your problem, can't you set your editor to default to UTF-8 and use that all the time?

Seems I can. Will there be any problems doing this for CSS, JS and PHP files too, as long as I stick to ISO Latin characters? Some sites suggest serving these file formats with UTF-8 headers, which sounds like a complication. unsure.gif
pandy
I don't know. I remember reading something about that but it was looong ago. Since I've stuck with iso-latin for files I create myself I haven't given it much thought.
Christian J
If ISO Latin characters are encoded the same in ANSI and UTF-8, like you wrote, I guess there is literally no difference -- it's only if you want to use non-ISO Latin characters in CSS/JS/PHP (e.g. Hieroglyphs for variable names, or scripts that print Swedish text) that you may need to send UTF-8 headers.


pandy
No, not the whole ANSI range. Just the ASCII range, i.e. a-z, A-Z, 0-9 and common punctuation marks. You can't use ÅÄÖ, ê, ñ, ø, € and so on and save as either ANSI or UTF-8 and serve as whichever you choose. That is, English only!
Christian J
Oops, I meant ASCII, not ANSI.
Christian J
Now I tried converting a web site from iso-8859-1 to UTF-8, using TextPad. At first pages with Swedish text displayed correctly in the browser, but when I viewed source the Swedish letters å, ä and ö were changed into Ã¥, ä and ö, and the file encoding was back to ANSI. How did that happen? unsure.gif

When I saved the above garbled files as UTF-8 a second time, the browser displayed the garbled characters instead of Swedish text. When I changed the garbled characters back to Swedish text and saved as UTF-8 a third time it suddenly worked, so I guess I must have made some mistake, but I'm not sure what.

After some more testing I wonder if it's the pages' META charset (that I forgot to change from iso-8859-1 to UTF-8) that made TextPad act strange --but should a text editor pay attention to HTML tags? After batch correcting the META charsets I batch-saved the files, which now made them become UTF-8 by default, but this time åäö were changed into empty "[]" boxes in the source, and � ("?") characters in my browser. Seems things only work correctly if I save each file individually as UTF-8.

TextPad (v8.1.2) isn't very helpful, BTW. Changing its Default encoding for HTML files in the preferences doesn't seem to work, files still get saved as ANSI by default. Batch conversion doesn't seem possible either, instead you must(?) resave each file separately to change its encoding.
pandy
I have a growing list of those. Can't be bothered to edit the not relevant bits out, you get it. Well, the last isn't Unicode but it's there for a reason.

CODE
^!Replace "Ã¥" >> "å" TWAS
^!Replace "ä" >> "ä" TWAS
^!Replace "ö" >> "ö" TWAS
^!Replace "Ã…" >> "Å" TWAS
^!Replace "Ä" >> "Ä" TWAS
^!Replace "Ö" >> "Ö" TWAS
^!Replace "é" >> "é" TWAS
^!Replace "á" >> "á" TWAS
^!Replace "è" >> "è" TWAS
^!Replace "ü" >> "ü" TWAS
^!Replace "ø" >> "ø" TWAS
^!Replace "æ" >> "æ" TWAS
^!Replace "–" >> "-" TWAS
^!Replace "&" >> "&" TWAS


I know it's a somewhat retarded way to do it but since my Unicode illiterate editor is handy in so many other ways I open unicode files in it and replace the garbled characters with their ANSI counterparts. It of course only takes one click to run the above. The movie player I most often use don't do Unicode either, so it's mostly for subtitles that happens to be in Unicode.
Christian J
It does appear TextPad analyzes the META Charset tag. Here are some experiments:

1. I created an HTML file containing a <meta charset="UTF-8"> element and Swedish åäö characters. When saved the file became UTF-8 by default, and everything worked. Correction: now that I tested again, it did not work. UTF-8 is not the default encoding, even though I set it to in the preferences.

2. Created a second identical HTML file, but with a <meta charset="iso-8859-1"> element. When saved the file became ANSI by default. When I changed the META charset to UTF-8 and resaved, the file remained ANSI but Swedish letters were garbled. When I resaved explicitly as UTF-8 everything worked.

3. Created a third identical HTML file, but without any META charset. When saved it became ANSI by default. When I resaved as UTF-8, Swedish letters were garbled. When I added a <meta charset="UTF-8"> element and resaved, the file became UTF-8 and everything worked.
pandy
I got lost there at some point...
Christian J
Yes, I think I confused myself too. Maybe I should delete that post to protect the innocent...
Christian J
Did some more tests. This makes my head spin, so maybe I got it wrong again.

TextPad's Preferences let you specify different default encodings for various document classes, but it seems the default for text documents affects other document classes too. In other words, UTF-8 as default for text documents will also apply to new documents saved as HTML --TextPad's default for HTML documents has no effect.

Furthermore, if you explicitly specify a non-default encoding when saving an HTML document, TextPad obeys you even if you use the wrong META charset (and characters like "å ä ö" are consequently garbled). Apparently TextPad encodes text differently depending on the META charset. When you open such a document in TextPad, it again seems TextPad lets the META charset decide the encoding:

- A document with <meta charset="UTF-8"> can be explicitly saved as ANSI (which turns "å ä ö" into "? ? ?"), but when you open it again TextPad considers it UTF-8.

- A document with <meta charset="iso-8859-1"> can be explicitly saved as UTF-8 (which turns "å ä ö" into "Ã¥ ä ö"), but when you open it again TextPad considers it ANSI.
pandy
Never used TextPad, so can't be of much help I'm afraid.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2024 Invision Power Services, Inc.