infiniter wrote:Can someone post the czech subs for a better translation? If this had been a Google translation, then it would be better.
I am affraid, that Google translation from Czech (and another slavonic languages) works much worse, than in the case of e.g. germanic languages.
BizarreLoveTriangle wrote:I think ASCII vs. Unicode is similar issue as 8.3 vs. long filenames. All Linux applications that I know/use are already fully supporting Unicode. It is quite amazing, actually.
Unicode is simply primary 16-bit encoding of characters unifying all former 8-bit encodings. While among 8-bit encodings is great mess, fortunatelly Unicode has only one internationally agreed standard. The exist also higher superset UCS-32 with 32-bit encoding, based on ISO/IEC 10646; Unicode is his subset. See
http://www.unicode.org , resp
http://www.unicode.org/charts .
Primarily unified Unicode data can be saved in various manner. In memory, there are used only as "BIG ENDIAN" and "LITLE ENDIAN" saving higher / lower byte of 16 bites first (on lower adress). When saved in file, there exist several compressing algorithms: UTF-16 (no compression), UTF-8 (genuine ASCII characters are encoded by 1 byte, majority of others as 2 bytes, exotic (Chinese, ...) characters with high value as 3 bytes - and all characters fromUCS-32 up to 6 bytes). UTF-7 is used for transmission thru 7 bit nets.
While use of little or big endian depends on the computer processor architecture, conventionally, all unicode characters are written in big endian in the form U\xxxx, where "xxxx" are four hexadecimal digits (0-9, A-F).
To distinguish betveen UTF16 with big or little endian, programmers introduce so called Byte Order Marker (BOM) as special Unicode character U\FEFF. By convention, it is prefixed to another data in unicode files - so little endian UTF-16 files start with hexa FFFE, big endian UTF-16BE with hexa FEFF. But hereafter, the same character is used to prefix also UTF-8 files (with corresponding 3-byte encoding). BOM has no graphical representation and is skipped during transfer from file to memory.
All good software, when recognises BOM character preceding text data, process them using corresponding unicode encoding. But such behaviour is not obligatory. And BOM character preceding own text data is also not obligatory - then can occure mess.
Usage of Big or Little endian depends on the computer processor architecture; clasical processors use big endian, processors developed from 8-bit architecture (including all "x86" processors used on majority of computers working with Windows) use little endian. But use of 8-bit or 16-cir encoding depends on the programmer; operating systems generally prefer some of them (Windows from NT version Unicode), but support both methods. For this reason, used encoding depends on standards respected by applied programmer. And operating system specifies some 8-bit encoding as "common" and programs use this information. Windows in USA WIN CP 1252, in Central Europe WIN CP 1250, in Russia WIN CP1251 - cyrillic. Common encoding is set during OS generation and can be changed.
Such encodings are generally used for text files (including text subtitles *.srt and some *.sub; other *.sub files contain graphical representation independent of encoding); for encoding of non-ascii characters in filenames (including ed2k references) are used other techniques. Conversion of text files among various forms of Unicode is simple; conversion to any 8-bit encoding is no problem, but converter may find some unknown characters (generally replaced by ?). The same is true, when converting from 8-bit encoding to Unicode, but source encoding must be known. I can present one such program in C# for Windows frame 3.5.
All inteligent viewers supporting *.srt and text *.sub subtitles support Unicode. E.g. for VLC player one can specify one from 8-bit encodings as base, but when player finds BOM marker on file beginning, it process data as Unicode (UTF-8 and UTF-16). But no all viewers and other programs programs are inteligent.
Another problem is conversion of names to latin. While in languages using latin letters with accents or diacritic, all accents are simply removed, for other languages it is not so clear. Namely in Russian.