|
web
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Crazy with character encodingI have a text file with following content: "((^)|(.* +))§§§§§§§§" if I read it with: k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII); k.readtotheend() Then "§" is lost If read it with UTF7 then "+" is lost. Please help, how can I read the file into string so that I have all characters? Thancs Zhiv Kurilka <Zhiv.Kuri***@LozhkaVil.ca> wrote:
> I have a text file with following content: Yes, because that isn't an ASCII character.> "((^)|(.* +))§§§§§§§§" > > if I read it with: > k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII); > > k.readtotheend() > > Then "§" is lost > If read it with UTF7 Yes, because it isn't a UTF-7 file. (UTF-7 is very rarely used outside > > then "+" is lost. mail.) > Please help, how can I read the file into string so that I have all Well, what encoding is the file in? What created it?> characters? -- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too I have created it by hand, it is just a number of characters. I suppose I
nead a bytereader for that. Right? "Jon Skeet [C# MVP]" <sk***@pobox.com> schrieb im Newsbeitrag Zhiv Kurilka <Zhiv.Kuri***@LozhkaVil.ca> wrote:news:MPG.1f3c3b1ccf01687c98d381@msnews.microsoft.com... > I have a text file with following content: Yes, because that isn't an ASCII character.> "((^)|(.* +))§§§§§§§§" > > if I read it with: > k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII); > > k.readtotheend() > > Then "§" is lost > If read it with UTF7 Yes, because it isn't a UTF-7 file. (UTF-7 is very rarely used outside> > then "+" is lost. mail.) > Please help, how can I read the file into string so that I have all Well, what encoding is the file in? What created it?> characters? -- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too The question is rather with what did you create it, and how did you save it.
I'm guessing it is saved with the default ansi table for your computer, in which case using Encoding.Default when reading it should give you the proper string. On Thu, 03 Aug 2006 19:08:48 +0200, Zhiv Kurilka <Zhiv.Kuri***@LozhkaVil.ca> wrote: Show quoteHide quote > I have created it by hand, it is just a number of characters. I suppose I > nead a bytereader for that. Right? > > "Jon Skeet [C# MVP]" <sk***@pobox.com> schrieb im Newsbeitrag > news:MPG.1f3c3b1ccf01687c98d381@msnews.microsoft.com... > Zhiv Kurilka <Zhiv.Kuri***@LozhkaVil.ca> wrote: >> I have a text file with following content: >> "((^)|(.* +))§§§§§§§§" >> >> if I read it with: >> k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII); >> >> k.readtotheend() >> >> Then "§" is lost > > Yes, because that isn't an ASCII character. > >> If read it with UTF7 >> >> then "+" is lost. > > Yes, because it isn't a UTF-7 file. (UTF-7 is very rarely used outside > mail.) > >> Please help, how can I read the file into string so that I have all >> characters? > > Well, what encoding is the file in? What created it? > -- Happy coding! Morten Wennevik [C# MVP] I have created it with Visual Studio text editor. It is a plain text file.
(I suppose). I opened a new text document and wrote those symbols into it. Encoding.default gives a wrong string. I am trying byte reader now "Morten Wennevik" <MortenWenne***@hotmail.com> schrieb im Newsbeitrag news:op.tdpw09k4klbvpo@stone...Show quoteHide quote > The question is rather with what did you create it, and how did you save > it. > I'm guessing it is saved with the default ansi table for your computer, in > which case using Encoding.Default when reading it should give you the > proper string. > > On Thu, 03 Aug 2006 19:08:48 +0200, Zhiv Kurilka > <Zhiv.Kuri***@LozhkaVil.ca> wrote: > >> I have created it by hand, it is just a number of characters. I suppose I >> nead a bytereader for that. Right? >> >> "Jon Skeet [C# MVP]" <sk***@pobox.com> schrieb im Newsbeitrag >> news:MPG.1f3c3b1ccf01687c98d381@msnews.microsoft.com... >> Zhiv Kurilka <Zhiv.Kuri***@LozhkaVil.ca> wrote: >>> I have a text file with following content: >>> "((^)|(.* +))§§§§§§§§" >>> >>> if I read it with: >>> k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII); >>> >>> k.readtotheend() >>> >>> Then "§" is lost >> >> Yes, because that isn't an ASCII character. >> >>> If read it with UTF7 >>> >>> then "+" is lost. >> >> Yes, because it isn't a UTF-7 file. (UTF-7 is very rarely used outside >> mail.) >> >>> Please help, how can I read the file into string so that I have all >>> characters? >> >> Well, what encoding is the file in? What created it? >> > > > > -- > Happy coding! > Morten Wennevik [C# MVP] A byte reader won't help you as it needs the same kind of encoding as the StreamReader to be able to make sense of the bytes.
My Visual Studio 2005 seems to want to save a text file as Windows-1252, so you can try using that. StreamReader("file.txt", Encoding.GetEncoding("Windows-1252")); On Thu, 03 Aug 2006 19:38:44 +0200, Zhiv Kurilka <Zhiv.Kuri***@LozhkaVil.ca> wrote: Show quoteHide quote > I have created it with Visual Studio text editor. It is a plain text file. > (I suppose). I opened a new text document and wrote those symbols into it. > Encoding.default gives a wrong string. > I am trying byte reader now > "Morten Wennevik" <MortenWenne***@hotmail.com> schrieb im Newsbeitrag > news:op.tdpw09k4klbvpo@stone... >> The question is rather with what did you create it, and how did you save >> it. >> I'm guessing it is saved with the default ansi table for your computer, in >> which case using Encoding.Default when reading it should give you the >> proper string. >> >> On Thu, 03 Aug 2006 19:08:48 +0200, Zhiv Kurilka >> <Zhiv.Kuri***@LozhkaVil.ca> wrote: >> >>> I have created it by hand, it is just a number of characters. I suppose I >>> nead a bytereader for that. Right? >>> >>> "Jon Skeet [C# MVP]" <sk***@pobox.com> schrieb im Newsbeitrag >>> news:MPG.1f3c3b1ccf01687c98d381@msnews.microsoft.com... >>> Zhiv Kurilka <Zhiv.Kuri***@LozhkaVil.ca> wrote: >>>> I have a text file with following content: >>>> "((^)|(.* +))§§§§§§§§" >>>> >>>> if I read it with: >>>> k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII); >>>> >>>> k.readtotheend() >>>> >>>> Then "§" is lost >>> >>> Yes, because that isn't an ASCII character. >>> >>>> If read it with UTF7 >>>> >>>> then "+" is lost. >>> >>> Yes, because it isn't a UTF-7 file. (UTF-7 is very rarely used outside >>> mail.) >>> >>>> Please help, how can I read the file into string so that I have all >>>> characters? >>> >>> Well, what encoding is the file in? What created it? >>> >> >> >> >> -- >> Happy coding! >> Morten Wennevik [C# MVP] > > > -- Happy coding! Morten Wennevik [C# MVP] Zhiv Kurilka wrote:
> I have created it with Visual Studio text editor. It is a plain text file. Have you tried to read it as UTF8? I think VS saves files in that format.> (I suppose). I opened a new text document and wrote those symbols into it. > Encoding.default gives a wrong string. > I am trying byte reader now Max Some of the files are created using VS2003 others VS2005. I need some way to
get encoding from file automatically. Is it possible? P.S. I have tried UTF8. For most files it fails. I am sorry, but I still don't understand what is going on. Why VS editor shows files properly, but I can't write them? Show quoteHide quote "Markus Stoeger" <spamhole@gmx.at> schrieb im Newsbeitrag news:OBlKVVytGHA.3552@TK2MSFTNGP03.phx.gbl... > Zhiv Kurilka wrote: >> I have created it with Visual Studio text editor. It is a plain text >> file. (I suppose). I opened a new text document and wrote those symbols >> into it. Encoding.default gives a wrong string. >> I am trying byte reader now > > Have you tried to read it as UTF8? I think VS saves files in that format. > > Max VS Editor shows files properly because it reads them using the correct encoding.
Have you tried Windows-1252? On Thu, 03 Aug 2006 20:08:36 +0200, Zhiv Kurilka <Zhiv.Kuri***@LozhkaVil.ca> wrote: Show quoteHide quote > Some of the files are created using VS2003 others VS2005. I need some way to > get encoding from file automatically. Is it possible? > P.S. I have tried UTF8. For most files it fails. > I am sorry, but I still don't understand what is going on. Why VS editor > shows files properly, but I can't write them? > > "Markus Stoeger" <spamhole@gmx.at> schrieb im Newsbeitrag > news:OBlKVVytGHA.3552@TK2MSFTNGP03.phx.gbl... >> Zhiv Kurilka wrote: >>> I have created it with Visual Studio text editor. It is a plain text >>> file. (I suppose). I opened a new text document and wrote those symbols >>> into it. Encoding.default gives a wrong string. >>> I am trying byte reader now >> >> Have you tried to read it as UTF8? I think VS saves files in that format. >> >> Max > > > -- Happy coding! Morten Wennevik [C# MVP] Zhiv Kurilka <Zhiv.Kuri***@LozhkaVil.ca> wrote:
> Some of the files are created using VS2003 others VS2005. I need some way to No. There are ways of making a reasonable guess, but it would still be > get encoding from file automatically. Is it possible? a guess. > P.S. I have tried UTF8. For most files it fails. So it's not UTF-8 and it's not the default encoding for the system. That's fairly odd. Perhaps you could mail me some of the files? > I am sorry, but I still don't understand what is going on. Why VS editor Visual Studio presumably guesses correctly what encoding they're in.> shows files properly, but I can't write them? It sounds like you're still not really sure what an encoding is though. See if http://www.pobox.com/~skeet/csharp/unicode.html helps. -- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too Dear Sirs,
I have uploaded the file: http://a1234113.narod.ru/test.zip I tried all your suggestions. Dim _sr As New System.IO.StreamReader(_fn, System.Text.Encoding.XXXX) m_filetext = _sr.ReadToEnd _sr.Close() Either + or § is missing or all is crap. Could you give me an advice? Thanks a lot Zhiv Kurilka <Zhiv.Kuri***@LozhkaVil.ca> wrote:
Show quoteHide quote > Dear Sirs, Encoding.Default works fine for me.> I have uploaded the file: > http://a1234113.narod.ru/test.zip > > I tried all your suggestions. > Dim _sr As New System.IO.StreamReader(_fn, System.Text.Encoding.XXXX) > > m_filetext = _sr.ReadToEnd > > _sr.Close() > > Either + or § is missing or all is crap. > > Could you give me an advice? -- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too
Show quote
Hide quote
"Jon Skeet [C# MVP]" <sk***@pobox.com> schrieb: Maybe the OP's version of Windows uses a different default Windows-ANSI >> I have uploaded the file: >> http://a1234113.narod.ru/test.zip >> >> I tried all your suggestions. >> Dim _sr As New System.IO.StreamReader(_fn, System.Text.Encoding.XXXX) >> >> m_filetext = _sr.ReadToEnd >> >> _sr.Close() >> >> Either + or § is missing or all is crap. >> >> Could you give me an advice? > >Encoding.Default works fine for me. codepage. -- M S Herfried K. Wagner M V P <URL:http://dotnet.mvps.org/> V B <URL:http://classicvb.org/petition/> Herfried K. Wagner [MVP] <hirf-spam-me-here@gmx.at> wrote:
Show quoteHide quote > "Jon Skeet [C# MVP]" <sk***@pobox.com> schrieb: But in that case, I'd have expected Visual Studio to use that default > >> I have uploaded the file: > >> http://a1234113.narod.ru/test.zip > >> > >> I tried all your suggestions. > >> Dim _sr As New System.IO.StreamReader(_fn, System.Text.Encoding.XXXX) > >> > >> m_filetext = _sr.ReadToEnd > >> > >> _sr.Close() > >> > >> Either + or § is missing or all is crap. > >> > >> Could you give me an advice? > > > >Encoding.Default works fine for me. > > Maybe the OP's version of Windows uses a different default Windows-ANSI > codepage. encoding too - if it works in Studio and it's CP-1252, I can't think why Studio would choose 1252 instead of the default code page. -- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too
Show quote
Hide quote
"Zhiv Kurilka" <Zhiv.Kuri***@LozhkaVil.ca> ha scritto nel messaggio I had the same problem: encoding convert the byte into char using their > Hi, > I have a text file with following content: > "((^)|(.* +))§§§§§§§§" > > if I read it with: > k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII); > > k.readtotheend() > > > > Then "§" is lost > > If read it with UTF7 > > then "+" is lost. > > Please help, how can I read the file into string so that I have all > characters? rules, but I just want the char that corrispond to the byte without any conversion. The solution is boring but quite simple: read and write the file as a byte array and restore it casting each byte into char and back: public static byte[] GetBytes(string s) { byte[] b = new byte[s.Length]; for (int i = 0; i < b.Length; ++i) { b[i] = (byte)s[i]; } return b; } public static byte[] GetBytes(char[] c) { byte[] b = new byte[c.Length]; for (int i = 0; i < b.Length; ++i) { b[i] = (byte)c[i]; } return b; } public static string GetString(byte[] buffer) { return new string(GetChars(buffer)); } public static char[] GetChars(byte[] b) { char[] c = new char[b.Length]; for (int i = 0; i < b.Length; ++i) { c[i] = (char)b[i]; } return c; } Fabio <znt.fa***@virgilio.it> wrote:
> I had the same problem: encoding convert the byte into char using their That's like saying you want the English that corresponds to a French > rules, but I just want the char that corrispond to the byte without any > conversion. word without any translation. > The solution is boring but quite simple: read and write the file as a byte That's effectively using ISO-Latin-1 encoding. It's still an encoding.> array and restore it casting each byte into char and back: > > > public static byte[] GetBytes(string s) > { > byte[] b = new byte[s.Length]; > for (int i = 0; i < b.Length; ++i) > { > b[i] = (byte)s[i]; > } > return b; > } -- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too "Jon Skeet [C# MVP]" <sk***@pobox.com> ha scritto nel messaggio Mmmm... what I want is a "double way" conversion.> Fabio <znt.fa***@virgilio.it> wrote: >> I had the same problem: encoding convert the byte into char using their >> rules, but I just want the char that corrispond to the byte without any >> conversion. > > That's like saying you want the English that corresponds to a French > word without any translation. Each character into the ASCII table has a corrisponding byte, so I want that a char converted to a byte can be reversed having back the original value. All the encoders exposed by System.Text I tried do some transformation on the value and the original information can be lost. > That's effectively using ISO-Latin-1 encoding. It's still an encoding. Can I have this bheavior directly via some .Net encoder?Thanks Fabio <znt.fa***@virgilio.it> wrote:
> > Fabio <znt.fa***@virgilio.it> wrote: There's no way you can do that with a single byte, as a char is a > >> I had the same problem: encoding convert the byte into char using their > >> rules, but I just want the char that corrispond to the byte without any > >> conversion. > > > > That's like saying you want the English that corresponds to a French > > word without any translation. > > Mmmm... what I want is a "double way" conversion. > Each character into the ASCII table has a corrisponding byte, so I want that > a char converted to a byte can be reversed having back the original value. 16-bit value and a byte is an 8-bit value. > All the encoders exposed by System.Text I tried do some transformation on If you want to encode arbitrary binary data as text data and then > the value and the original information can be lost. decode it, you should use Base64 - that's what it's there for. Pretty much any other scheme is asking for trouble. If you want to encode arbitrary Unicode text data as binary data, I'd normally suggest using UTF-8. It's efficient for "mainly ASCII" text, and covers the whole of Unicode. > > That's effectively using ISO-Latin-1 encoding. It's still an encoding. You can use Encoding.GetEncoding(28591) but be aware that between 128 > > Can I have this bheavior directly via some .Net encoder? and 139 there's a bit of a no-mans-land. There's contradictory evidence, but some of it points to ISO-8859-1 not having any characters defined in that range. -- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too "Jon Skeet [C# MVP]" <sk***@pobox.com> ha scritto nel messaggio Wait.news:MPG.1f3c779036def0ec98d38b@msnews.microsoft.com... > > There's no way you can do that with a single byte, as a char is a > 16-bit value and a byte is an 8-bit value. > Let's take for a moment VB6. It uses Unicode strings, but Chr(200) (the "È" for example) is always perfectly reversible into 200 using ASC("È"). This works well if I use (char)200 <--> (byte)'È'. There is no way if I use an encoder: the char I encode is not returned correctly decoding it, i.e. I can encode "È" into a byte value (of course using a NON double byte encoder) and when I decode it back I could get a "§". This is not so good when I do comunications via socket or via RS232. The ASCII table give a number (and only one) for each char. Encoder/Decoder seems to assign different chars to the same number or seems to lost informations so decoding the number I could get a char that is not the one encoded. Fabio Z wrote:
> > There's no way you can do that with a single byte, as a char is a Are you suggesting that VB magically manages to represent 65536> > 16-bit value and a byte is an 8-bit value. > > Wait. > Let's take for a moment VB6. > It uses Unicode strings, but Chr(200) (the "È" for example) is always > perfectly reversible into 200 using ASC("È"). > > This works well if I use (char)200 <--> (byte)'È'. different values in a single byte? I suspect you'll find there are plenty of Unicode characters (actually UCS-2 characters - let's not go into full Unicode > U+FFFF for the moment) for which ASC doesn't work on systems with a fixed single-byte default character encoding. > There is no way if I use an encoder: the char I encode is not returned If you use the same encoding for both encoding and decoding, *and* if> correctly decoding it, i.e. I can encode "È" into a byte value (of course > using a NON double byte encoder) and when I decode it back I could get a > "§". that encoding supports the character you wish to encode, it will always return the correct character. > This is not so good when I do comunications via socket or via RS232. Well, it's not so good if you don't use the same encoding on bothsides... > The ASCII table give a number (and only one) for each char. You still seem to be confused as to the purpose of encodings. Please> Encoder/Decoder seems to assign different chars to the same number or seems > to lost informations so decoding the number I could get a char that is not > the one encoded. read http://www.pobox.com/~skeet/csharp/unicode.html Jon "Jon Skeet [C# MVP]" <sk***@pobox.com> ha scritto nel messaggio I could be confused about this but I'm not so stupid to use different > If you use the same encoding for both encoding and decoding, *and* if > that encoding supports the character you wish to encode, it will always > return the correct character. encoders to encode and decode. If I get some time I'll provide an example. Fabio Z wrote:
> "Jon Skeet [C# MVP]" <sk***@pobox.com> ha scritto nel messaggio And similarly the designers of encodings aren't so stupid as to stop> > > If you use the same encoding for both encoding and decoding, *and* if > > that encoding supports the character you wish to encode, it will always > > return the correct character. > > I could be confused about this but I'm not so stupid to use different > encoders to encode and decode. you from encoding and then decoding to get back the original text :) > If I get some time I'll provide an example. That would be good. I suspect you'll find it hard to provide onewithout including characters which aren't supported by the chosen encoding (or a code error). Jon "Jon Skeet [C# MVP]" <sk***@pobox.com> ha scritto nel messaggio Ok, waiting for it, can you give me an example that can convert a byte[] >> If I get some time I'll provide an example. > > That would be good. I suspect you'll find it hard to provide one > without including characters which aren't supported by the chosen > encoding (or a code error). :) that contains all the 0..255 byte values to a string and that convert it back to the original byte array. Fabio Z wrote:
Show quoteHide quote > "Jon Skeet [C# MVP]" <sk***@pobox.com> ha scritto nel messaggio Sure - although it's not a good idea (see later).> > > >> If I get some time I'll provide an example. > > > > That would be good. I suspect you'll find it hard to provide one > > without including characters which aren't supported by the chosen > > encoding (or a code error). > > :) > > Ok, waiting for it, can you give me an example that can convert a byte[] > that contains all the 0..255 byte values to a string and that convert it > back to the original byte array. using System; using System.Text; class Test { static void Main() { byte[] b = new byte[256]; for (int i=0; i < 256; i++) { b[i] = (byte)i; } Encoding enc = Encoding.GetEncoding(28591); string x = enc.GetString(b); byte[] o = enc.GetBytes(x); Console.WriteLine ("Length={0}", o.Length); for (int i=0; i < 256; i++) { if (o[i] != i) { Console.WriteLine ("Difference at index {0}", i); } } } } Now, that's demonstrating that it happens to work, but it's not a good way of encoding arbitrary binary data. To do that, I'd recommend using Base64 - Convert.ToBase64String and Convert.FromBase64String. Encodings should be used when you *start* with text data, encode it to binary, and then decode that binary to text data. Decoding binary data which didn't really start off as text and then get encoded is a bad idea. Jon Fabio Z wrote:
> "Jon Skeet [C# MVP]" <sk***@pobox.com> ha scritto nel messaggio You didn't say anything about requiring the string to contain the same > > >>> If I get some time I'll provide an example. >> That would be good. I suspect you'll find it hard to provide one >> without including characters which aren't supported by the chosen >> encoding (or a code error). > > :) > > Ok, waiting for it, can you give me an example that can convert a byte[] > that contains all the 0..255 byte values to a string and that convert it > back to the original byte array. number of characters as the byte[] array has members, so: using System; class Program { static void Main(string[] args) { byte[] data = new byte[1024]; for (int i = 0; i <= 255; i++) data[i] = data[i + 256] = data[i + 512] = data[i + 768] = (byte)i; // I have some byte data, but I can't print it! string printable = Convert.ToBase64String(data); // Now I have the data in printable form, look: Console.WriteLine(printable); // I should be able to get the data back, of course: byte[] data2 = Convert.FromBase64String(printable); // is it the same? bool theSame = true; if (data.Length == data2.Length) { for (int i = 0; i < data.Length; i++) if (data[i] == data2[i]) // carry on ; else { theSame = false; break; } } else theSame = false; if (theSame) Console.WriteLine("Data is the same after transformation"); else Console.WriteLine("Data is NOT the same!!!!"); Console.ReadLine(); } } -- Larry Lard larryl***@googlemail.com The address is real, but unread - please reply to the group For VB and C# questions - tell us which version > Ok, waiting for it, can you give me an example that can convert a byte[] Most encodings have undefined areas and do not cover the complete range from > that contains all the 0..255 byte values to a string and that convert it > back to the original byte array. Wrong. 0 to 255. So some values will not be converted to Unicode (because they are not allocated in the original encoding, to begin with). If 0..255 is what you need, then is no text data, is binary data, and you should use some other ways to convert to text for transfer (MIME, BinHex, etc.). -- Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email "Mihai N." <nmihai_year_2***@yahoo.com> ha scritto nel messaggio A string is not just "text".> If 0..255 is what you need, then is no text data, is binary data, > and you should use some other ways to convert to text for transfer > (MIME, BinHex, etc.). Is a sequence of chars (that in memory are bytes). So I think you all are definetively talking in a different language than me about this issue. My initial code works well and cannot be replaced by some trick such as Mime or Base64 encoding, that transforms the original value. The old CopyMemory() did the work as I want to, because it does not say itself "oh! this is not text! I refuse to convert it to bytes". It treats strings for what they are: a sequence of byte, nothing more, nothing less. ;) Fabio <znt.fa***@virgilio.it> wrote:
> > If 0..255 is what you need, then is no text data, is binary data, The in-memory encoding happens to be UTF-16. It's almost irrelevant > > and you should use some other ways to convert to text for transfer > > (MIME, BinHex, etc.). > > A string is not just "text". > Is a sequence of chars (that in memory are bytes). though. > So I think you all are definetively talking in a different language than me When you're passing binary data around as text, you really want to make > about this issue. > > My initial code works well and cannot be replaced by some trick such as Mime > or Base64 encoding, that transforms the original value. sure it doesn't get screwed up by systems which assume null-terminated strings etc. Base64 copes with this. Your code doesn't. > The old CopyMemory() did the work as I want to, because it does not say You're doomed to run into encoding issues with that mentality, I'm > itself "oh! this is not text! I refuse to convert it to bytes". > It treats strings for what they are: a sequence of byte, nothing more, > nothing less. afraid. Treat binary data as binary data, text as text, and encode between the two in rigidly defined ways. Anything else leads to problesm. -- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too "Jon Skeet [C# MVP]" <sk***@pobox.com> ha scritto nel messaggio Ok :)> You're doomed to run into encoding issues with that mentality, I'm > afraid. Treat binary data as binary data, text as text, and encode > between the two in rigidly defined ways. Anything else leads to > problesm. With my mentality I'm doomed to make serial port and sockets comunications works [efficiently] :) With "bug free text encoding mentality" them don't. I'll accept my doom on this argument :) I'll leave to Base64 and Mime encoding their role: sending and receiving e-mails. Fabio <znt.fa***@virgilio.it> wrote:
> > You're doomed to run into encoding issues with that mentality, I'm Serial ports and sockets deal with binary data. If you've got binary > > afraid. Treat binary data as binary data, text as text, and encode > > between the two in rigidly defined ways. Anything else leads to > > problesm. > > Ok :) > With my mentality I'm doomed to make serial port and sockets comunications > works [efficiently] :) data you want to send across serial ports and sockets, you shouldn't be converting it to or from a string to start with. > With "bug free text encoding mentality" them don't. I don't remember anyone other than yourself bringing up mime encoding > > I'll accept my doom on this argument :) > > I'll leave to Base64 and Mime encoding their role: sending and receiving > e-mails. (although I could be wrong). Base64 has plenty of uses outside email. -- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too > A string is not just "text". I am not sure what you mean. Text is also "a sequence of chars"> Is a sequence of chars (that in memory are bytes). What is the differenct between text and string? The main difference between text/string and "just bytes" is that not any sequence of bytes constitute valid text. > So I think you all are definetively talking in a different language than me The only code I have seen from you is this:> about this issue. Probably. > My initial code works well and cannot be replaced by some trick such as > Mime or Base64 encoding, that transforms the original value. Then > The old CopyMemory() did the work as I want to, because it does not say > itself "oh! this is not text! I refuse to convert it to bytes". public static byte[] GetBytes(string s) { byte[] b = new byte[s.Length]; for (int i = 0; i < b.Length; ++i) { b[i] = (byte)s[i]; } return b; } which casts from a character (16 bits) to a byte (8 bits). So it is 100% sure to loose information. > It treats strings for what they are: a sequence of byte, nothing more, Nope. Strings are "a certain type of sequence of bytes"> nothing less. Any string is a sequence of bytes, but not any sequence of bytes is a string. -- Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email Fabio wrote:
<snip> > Mmmm... what I want is a "double way" conversion. <snip>> Each character into the ASCII table has a corrisponding byte, so I want that > a char converted to a byte can be reversed having back the original value. > > All the encoders exposed by System.Text I tried do some transformation on > the value and the original information can be lost. > Can I have this bheavior directly via some .Net encoder? The Windows ANSI encoding (engoding number 1252) usually works for me,because (AFAIK) it doesn't apply any transformation to the individual byte, i.e., there's a mapping from each byte value to each ANSI char, for a total of 256 possible chars (control chars included). Dim E As System.Text.Encoding = _ System.Text.Encoding.GetEncoding(1252) Many encodigs use two or four bytes in the representation of a char; others use a multibyte system where some specific byte values indicate that the following sequence is a multibyte char. This is not the case with the ANSI encoding. In ANSI, each byte value matches a corresponding char. Of course, if the string you're encoding contains chars outside the ANSI range, such chars will be misrepresented. Also, if you read a non-ansi sequence of bytes and convert them to a string using ANSI, you'll probably get some strange results. HTH. Regards, Branco. Branco Medeiros wrote:
> > All the encoders exposed by System.Text I tried do some transformation on How can you say there isn't any transformation, and then talk about> > the value and the original information can be lost. > <snip> > > Can I have this bheavior directly via some .Net encoder? > > The Windows ANSI encoding (engoding number 1252) usually works for me, > because (AFAIK) it doesn't apply any transformation to the individual > byte, i.e., there's a mapping from each byte value to each ANSI char, > for a total of 256 possible chars (control chars included). there being a mapping from each byte value to a character? That *is* the transformation. Talking about "the" Windows ANSI Encoding is like talking about "the" extended ASCII encoding. There are lots of different encodings which exhibit the same behaviour as 1252, i.e. they have a mapping from any byte to one of the 256 characters they represent. Each represents a different set of 256 characters. > This is not the case with the ANSI encoding. In ANSI, each byte value Exactly - so it's like any other encoding: you've got to make sure you> matches a corresponding char. Of course, if the string you're encoding > contains chars outside the ANSI range, such chars will be > misrepresented. Also, if you read a non-ansi sequence of bytes and > convert them to a string using ANSI, you'll probably get some strange > results. use the right one. Code page 1252 has no magic powers. Jon Jon Skeet [C# MVP] wrote (inline):
<snip> > How can you say there isn't any transformation, and then talk about I thought it was clear that the kind of transformation I was talking> there being a mapping from each byte value to a character? That *is* > the transformation. about had to do with dropping control chars or composition of chars outside the Ansi range (codes 0 to 255). Of course, mapping a single byte to the corresponding (Ansi) char is the actual transformation. Thanks for point it out. > Talking about "the" Windows ANSI Encoding is like talking about "the" I guess you're right when you say that there are other encondings that> extended ASCII encoding. There are lots of different encodings which > exhibit the same behaviour as 1252, i.e. they have a mapping from any > byte to one of the 256 characters they represent. Each represents a > different set of 256 characters. act like the Ansi encoding, i.e., provide a one to one mapping from byte to char. It would be nice if someone (yourself, perhaps) took the time to identify them. People having to deal with legacy encodings would certainly appreciate that. On the other hand, I assume that there is *the* Ansi encoding, comprising the 256 chars chosen by Microsoft to represent the Western European latin char set, loosely based on a ANSI draft of the time (thus the characterization as Windows-Ansi), which is code page 1252. Of course, I may be wrong. <snip> > Code page 1252 has no magic powers. Best regards,:-)) No, it certainly hasn't. Branco. Branco Medeiros <branco.medei***@gmail.com> wrote:
> <snip> No - although *something* has to happen to characters outside the range > > How can you say there isn't any transformation, and then talk about > > there being a mapping from each byte value to a character? That *is* > > the transformation. > > I thought it was clear that the kind of transformation I was talking > about had to do with dropping control chars or composition of chars > outside the Ansi range (codes 0 to 255). of the character set. (Note that Windows-1252 is definitely *not* Unicode 0-255. They differ in the range 128 to 159 inclusive.) > Of course, mapping a single And that's the same kind of thing that other encodings do, except they > byte to the corresponding (Ansi) char is the actual transformation. > Thanks for point it out. may not be single byte to single char. Show quoteHide quote > > Talking about "the" Windows ANSI Encoding is like talking about "the" I *think* you are, I'm afraid.> > extended ASCII encoding. There are lots of different encodings which > > exhibit the same behaviour as 1252, i.e. they have a mapping from any > > byte to one of the 256 characters they represent. Each represents a > > different set of 256 characters. > > I guess you're right when you say that there are other encondings that > act like the Ansi encoding, i.e., provide a one to one mapping from > byte to char. It would be nice if someone (yourself, perhaps) took the > time to identify them. People having to deal with legacy encodings > would certainly appreciate that. > > On the other hand, I assume that there is *the* Ansi encoding, > comprising the 256 chars chosen by Microsoft to represent the Western > European latin char set, loosely based on a ANSI draft of the time > (thus the characterization as Windows-Ansi), which is code page 1252. > Of course, I may be wrong. http://www.stylusstudio.com/xsllist/200205/post01200.html and http://www.stylusstudio.com/xsllist/200205/post61190.html have a bit more information. For another example of a character encoding which could be regarded as an "ANSI" encoding, consider ASCII. This is also known as ANSI_X3.4-1968 (according to http://www.iana.org/assignments/character-sets) I *believe* people often talk about whatever their default 256-character encoding is as an "ANSI encoding" - and that's not always Windows-1252. For more evidence of this, see http://en.wikipedia.org/wiki/Code_page#Windows_.28ANSI.29_code_pages In particular: <quote> Microsoft defined a number of code pages known as the ANSI code pages (as the first one, 1252 was based on an ansi draft of what became ISO 8859-1). </quote> -- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too > On the other hand, I assume that there is *the* Ansi encoding, What MS documentation means when it says ANSI code page is not 1252.> comprising the 256 chars chosen by Microsoft to represent the Western > European latin char set, loosely based on a ANSI draft of the time > (thus the characterization as Windows-Ansi), which is code page 1252. > Of course, I may be wrong. It is the "default system code page" and depends on the system locale. It is 932 on Japanese sytems, 1250 on Russian, and so on (you can get the ANSI CP for a locale by using GetLocaleInfo with LOCALE_IDEFAULTANSICODEPAGE ) -- Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
Show quote
Hide quote
"Zhiv Kurilka" <Zhiv.Kuri***@LozhkaVil.ca> wrote in message Sounds to me like you are running into Unicode encoding - characters encodednews:efDykgxtGHA.2232@TK2MSFTNGP05.phx.gbl... > Hi, > I have a text file with following content: > "((^)|(.* +))§§§§§§§§" > > if I read it with: > k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII); > > k.readtotheend() > > > > Then "§" is lost > > If read it with UTF7 > > then "+" is lost. > > Please help, how can I read the file into string so that I have all > characters? with both big-endian and little-endian. Try using Encoding.Unicode. See if that helps.
1 Variable - 2 Forms
Words to Number MS Method Return Codes One Click Installation Printing on a new page Detect TcpClient Connection dropped Excel already runnig, how to update with VB use of mouse scroll in my crystal report viewer How to change ConnectionString for DataSet at runtime? filesystemobject.createfolder error codes |
|||||||||||||||||||||||