|
web
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Problem with embedded carriage returnsI have a .csv file that contains newline chars embedded between quotes in a line of text eg BSPADV1,John.public,9413,"Sharrock Ashley TEST STREET 1 TEST NSW 2200",Address Insufficient,,,Mbase Print Report,R7TDKPFMDBCLKE07CGJMFKKW6VVB/21,Sharrock Ashley There are actually some 19 columns of data but when a StreamReader.ReadLine method tries to read line by line, it only returns data up to the first embedded newline chars. Interestingly, the data does display correctly in Notepad, so I guess Notepad must ignore newline chars inside quotation marks. Is there anyway to read the above line and get the full line of data? My only thought so far is to use the ReadToEnd method and then try and remove the newline chars between quotes programatically before Spltting on the "real" newline chars. Thanks for any help
Show quote
Hide quote
"Harry" <harryNoSpam@ffapaysmart.com.au> schrieb: What do you mean by "displays correctly"? Notepad just displays the text > I have a .csv file that contains newline chars embedded between quotes in > a line of text eg > > BSPADV1,John.public,9413,"Sharrock Ashley > TEST STREET 1 > TEST NSW 2200",Address Insufficient,,,Mbase Print > Report,R7TDKPFMDBCLKE07CGJMFKKW6VVB/21,Sharrock Ashley > > There are actually some 19 columns of data but when a > StreamReader.ReadLine method tries to read line by line, it only returns > data up to the first embedded newline chars. Interestingly, the data does > display correctly in Notepad, so I guess Notepad must ignore newline chars > inside quotation marks. contained in the file. > Is there anyway to read the above line and get the full line of data? My This would be one possible approach. You may want to take a closer look at > only thought so far is to use the ReadToEnd method and then try and remove > the newline chars between quotes programatically before Spltting on the > "real" newline chars. regular expressions for simple "parsing" of the text file. Alternatively you may want to read the file line-by-line, analyze each line and concatenate the parts of a row which is split into multiple lines manually. However, the best approach depends on what exactly you want to achieve. -- M S Herfried K. Wagner M V P <URL:http://dotnet.mvps.org/> V B <URL:http://dotnet.mvps.org/dotnet/faqs/>
Show quote
Hide quote
"Herfried K. Wagner [MVP]" <hirf-spam-me-here@gmx.at> wrote in message Thanks Herfreid for you answer and you excellent support of this group.news:%23dukvrn2JHA.1712@TK2MSFTNGP03.phx.gbl... > "Harry" <harryNoSpam@ffapaysmart.com.au> schrieb: >> I have a .csv file that contains newline chars embedded between quotes in >> a line of text eg >> >> BSPADV1,John.public,9413,"Sharrock Ashley >> TEST STREET 1 >> TEST NSW 2200",Address Insufficient,,,Mbase Print >> Report,R7TDKPFMDBCLKE07CGJMFKKW6VVB/21,Sharrock Ashley >> >> There are actually some 19 columns of data but when a >> StreamReader.ReadLine method tries to read line by line, it only returns >> data up to the first embedded newline chars. Interestingly, the data does >> display correctly in Notepad, so I guess Notepad must ignore newline >> chars inside quotation marks. > > What do you mean by "displays correctly"? Notepad just displays the text > contained in the file. > >> Is there anyway to read the above line and get the full line of data? My >> only thought so far is to use the ReadToEnd method and then try and >> remove the newline chars between quotes programatically before Spltting >> on the "real" newline chars. > > This would be one possible approach. You may want to take a closer look > at regular expressions for simple "parsing" of the text file. > Alternatively you may want to read the file line-by-line, analyze each > line and concatenate the parts of a row which is split into multiple lines > manually. However, the best approach depends on what exactly you want to > achieve. > > -- > M S Herfried K. Wagner > M V P <URL:http://dotnet.mvps.org/> > V B <URL:http://dotnet.mvps.org/dotnet/faqs/> I have solved the problem by getting the StreamReader to load the entire file into a string var and then doing a buffer.Split(CChar(Environment.Newline)). I then read all the lines from the array produced. Seems SteamReader.Readline only grabs a row of data upto the first newline char it encounters (kinda makes sense) The method I have now employed seems to ignore any newline characters enclosed in quotation marks. Thanks again for your help Harry You issue stems from your considering a 'newline' to be a single character.
Under Windows, it is, in fact, a pair of characters comprising a Carriage Return character and a Line Feed character in that sequence. It is often referred to as a Cr/Lf pair. If what you describe is correct then the 'newline' imbbeded between quotes is not a 'newline' at all. I suspect it is a single character and is either a Cr or a Lf. Notepad only recognizes a Cr/Lf pair ans a line-break and therefore would show your 'line' unbroken. I further suspect that where the imbeded 'newline' should have been, Notepad would have shown an 'unprintable' character which looks like a hollow rectangle. The StreamReader.ReadLine() method defines a line as a sequence of characters followed by a Line Feed character, a Carriage Return character or a Cr/Lf pair. In your case, the 'ReadToEnd' and a split on Environment.NewLine is the appropriate course of action. Show quoteHide quote "Harry Strybos" <harrystry***@optusnet.com.au> wrote in message news:OwZJqYp2JHA.1092@TK2MSFTNGP06.phx.gbl... > "Herfried K. Wagner [MVP]" <hirf-spam-me-here@gmx.at> wrote in message > news:%23dukvrn2JHA.1712@TK2MSFTNGP03.phx.gbl... >> "Harry" <harryNoSpam@ffapaysmart.com.au> schrieb: >>> I have a .csv file that contains newline chars embedded between quotes >>> in a line of text eg >>> >>> BSPADV1,John.public,9413,"Sharrock Ashley >>> TEST STREET 1 >>> TEST NSW 2200",Address Insufficient,,,Mbase Print >>> Report,R7TDKPFMDBCLKE07CGJMFKKW6VVB/21,Sharrock Ashley >>> >>> There are actually some 19 columns of data but when a >>> StreamReader.ReadLine method tries to read line by line, it only returns >>> data up to the first embedded newline chars. Interestingly, the data >>> does display correctly in Notepad, so I guess Notepad must ignore >>> newline chars inside quotation marks. >> >> What do you mean by "displays correctly"? Notepad just displays the text >> contained in the file. >> >>> Is there anyway to read the above line and get the full line of data? My >>> only thought so far is to use the ReadToEnd method and then try and >>> remove the newline chars between quotes programatically before Spltting >>> on the "real" newline chars. >> >> This would be one possible approach. You may want to take a closer look >> at regular expressions for simple "parsing" of the text file. >> Alternatively you may want to read the file line-by-line, analyze each >> line and concatenate the parts of a row which is split into multiple >> lines manually. However, the best approach depends on what exactly you >> want to achieve. >> >> -- >> M S Herfried K. Wagner >> M V P <URL:http://dotnet.mvps.org/> >> V B <URL:http://dotnet.mvps.org/dotnet/faqs/> > > Thanks Herfreid for you answer and you excellent support of this group. > > I have solved the problem by getting the StreamReader to load the entire > file into a string var and then doing a > buffer.Split(CChar(Environment.Newline)). I then read all the lines from > the array produced. > > Seems SteamReader.Readline only grabs a row of data upto the first newline > char it encounters (kinda makes sense) > > The method I have now employed seems to ignore any newline characters > enclosed in quotation marks. > > Thanks again for your help > > Harry > On 2009-05-22, Stephany Young <noone@localhost> wrote:
Show quoteHide quote > You issue stems from your considering a 'newline' to be a single character. Unless he's dealing with a large data files, then it's a recipe for disaster.> > Under Windows, it is, in fact, a pair of characters comprising a Carriage > Return character and a Line Feed character in that sequence. It is often > referred to as a Cr/Lf pair. > > If what you describe is correct then the 'newline' imbbeded between quotes > is not a 'newline' at all. > > I suspect it is a single character and is either a Cr or a Lf. > > Notepad only recognizes a Cr/Lf pair ans a line-break and therefore would > show your 'line' unbroken. I further suspect that where the imbeded > 'newline' should have been, Notepad would have shown an 'unprintable' > character which looks like a hollow rectangle. > > The StreamReader.ReadLine() method defines a line as a sequence of > characters followed by a Line Feed character, a Carriage Return character or > a Cr/Lf pair. > > In your case, the 'ReadToEnd' and a split on Environment.NewLine is the > appropriate course of action. > -- Tom Shelton
Show quote
Hide quote
"Tom Shelton" <tom_shel***@comcastXXXXXXX.net> wrote in message You are correct, Tom. However, the files in question are always quite small news:uYAVu%23p2JHA.4744@TK2MSFTNGP04.phx.gbl... > On 2009-05-22, Stephany Young <noone@localhost> wrote: >> You issue stems from your considering a 'newline' to be a single >> character. >> >> Under Windows, it is, in fact, a pair of characters comprising a Carriage >> Return character and a Line Feed character in that sequence. It is often >> referred to as a Cr/Lf pair. >> >> If what you describe is correct then the 'newline' imbbeded between >> quotes >> is not a 'newline' at all. >> >> I suspect it is a single character and is either a Cr or a Lf. >> >> Notepad only recognizes a Cr/Lf pair ans a line-break and therefore would >> show your 'line' unbroken. I further suspect that where the imbeded >> 'newline' should have been, Notepad would have shown an 'unprintable' >> character which looks like a hollow rectangle. >> >> The StreamReader.ReadLine() method defines a line as a sequence of >> characters followed by a Line Feed character, a Carriage Return character >> or >> a Cr/Lf pair. >> >> In your case, the 'ReadToEnd' and a split on Environment.NewLine is the >> appropriate course of action. >> > > Unless he's dealing with a large data files, then it's a recipe for > disaster. > > -- > Tom Shelton eg 4 to 10 rows of data. I am intrigued, however, as to how you would go about reading in a large file of this nature. Stephany made a correct observation, btw, that the characters in between the quotes were, in fact, chr(13) only. I should have noticed that. Cheers Well now you've opened a can of worms Harry.
There is no silver bullet for this and there are a myriad of techniques you can use. The first thing you need to do is consider the provenance of the input file. Is the program that creates the file a program that you can have some influence over. If so, then the technique that I would favour is to have write, what I call, a special 'guard' character at the end of each full line. To read such a file you would then use the ReadLine() method in a loop in the usual manner, but, within each iteration of the loop, check to see if your buffer ends with the 'guard character. If it does then you have the full line. If it doesn't then repeatedly call the ReadLine() method and append the result to your buffer until it does. Once you have a full line then process it in the normal way and continue the main loop. If not, then you need to jump throught some hoops. If the input file is well-formed in that an opening quote always has a closing quote and the extra 'newline's always fall betwen a pair of quotes then instead of checking for a 'guard' character you could check to see if your buffer has an even number of quote characters. Again append the result of interim ReadLine() method calls until it does. This technique, of course, is less efficient becaue it requires constant inspection of individual characters in strings. Another technique is to forget about the ReadLine() method completely, decide on an arbitrary 'block' size and read that number of characters at a time. For each block, find the last instance of a Cr/Lf pair. Append everything before that pair to the remainder from the previous read as your substantive block and save everything after that pair as the remainder. Split the substantive block on Cr/Lf and process ecah line in the normal way. Mileage will vary when using any of these techniques due to file size, length of line, amount of memory resources available etc. The other factor is that everyone that you talk to will have their own favourite technique that, of course, they consider to be far superior to anybody else's ;) Show quoteHide quote "Harry" <harryNoSpam@ffapaysmart.com.au> wrote in message news:%23J0$Fox2JHA.4368@TK2MSFTNGP05.phx.gbl... > "Tom Shelton" <tom_shel***@comcastXXXXXXX.net> wrote in message > news:uYAVu%23p2JHA.4744@TK2MSFTNGP04.phx.gbl... >> On 2009-05-22, Stephany Young <noone@localhost> wrote: >>> You issue stems from your considering a 'newline' to be a single >>> character. >>> >>> Under Windows, it is, in fact, a pair of characters comprising a >>> Carriage >>> Return character and a Line Feed character in that sequence. It is often >>> referred to as a Cr/Lf pair. >>> >>> If what you describe is correct then the 'newline' imbbeded between >>> quotes >>> is not a 'newline' at all. >>> >>> I suspect it is a single character and is either a Cr or a Lf. >>> >>> Notepad only recognizes a Cr/Lf pair ans a line-break and therefore >>> would >>> show your 'line' unbroken. I further suspect that where the imbeded >>> 'newline' should have been, Notepad would have shown an 'unprintable' >>> character which looks like a hollow rectangle. >>> >>> The StreamReader.ReadLine() method defines a line as a sequence of >>> characters followed by a Line Feed character, a Carriage Return >>> character or >>> a Cr/Lf pair. >>> >>> In your case, the 'ReadToEnd' and a split on Environment.NewLine is the >>> appropriate course of action. >>> >> >> Unless he's dealing with a large data files, then it's a recipe for >> disaster. >> >> -- >> Tom Shelton > > You are correct, Tom. However, the files in question are always quite > small eg 4 to 10 rows of data. > > I am intrigued, however, as to how you would go about reading in a large > file of this nature. > > Stephany made a correct observation, btw, that the characters in between > the quotes were, in fact, chr(13) only. I should have noticed that. > > Cheers > > >
Show quote
Hide quote
"Stephany Young" <noone@localhost> wrote in message As usual your astute observations are spot on. You have raised an event in news:%23su9TBy2JHA.1416@TK2MSFTNGP04.phx.gbl... > Well now you've opened a can of worms Harry. > > There is no silver bullet for this and there are a myriad of techniques > you can use. > > The first thing you need to do is consider the provenance of the input > file. > > Is the program that creates the file a program that you can have some > influence over. > > If so, then the technique that I would favour is to have write, what I > call, a special 'guard' character at the end of each full line. To read > such a file you would then use the ReadLine() method in a loop in the > usual manner, but, within each iteration of the loop, check to see if your > buffer ends with the 'guard character. If it does then you have the full > line. If it doesn't then repeatedly call the ReadLine() method and append > the result to your buffer until it does. Once you have a full line then > process it in the normal way and continue the main loop. > > If not, then you need to jump throught some hoops. If the input file is > well-formed in that an opening quote always has a closing quote and the > extra 'newline's always fall betwen a pair of quotes then instead of > checking for a 'guard' character you could check to see if your buffer has > an even number of quote characters. Again append the result of interim > ReadLine() method calls until it does. This technique, of course, is less > efficient becaue it requires constant inspection of individual characters > in strings. > > Another technique is to forget about the ReadLine() method completely, > decide on an arbitrary 'block' size and read that number of characters at > a time. For each block, find the last instance of a Cr/Lf pair. Append > everything before that pair to the remainder from the previous read as > your substantive block and save everything after that pair as the > remainder. Split the substantive block on Cr/Lf and process ecah line in > the normal way. > > Mileage will vary when using any of these techniques due to file size, > length of line, amount of memory resources available etc. > > The other factor is that everyone that you talk to will have their own > favourite technique that, of course, they consider to be far superior to > anybody else's ;) > > > "Harry" <harryNoSpam@ffapaysmart.com.au> wrote in message > news:%23J0$Fox2JHA.4368@TK2MSFTNGP05.phx.gbl... >> "Tom Shelton" <tom_shel***@comcastXXXXXXX.net> wrote in message >> news:uYAVu%23p2JHA.4744@TK2MSFTNGP04.phx.gbl... >>> On 2009-05-22, Stephany Young <noone@localhost> wrote: >>>> You issue stems from your considering a 'newline' to be a single >>>> character. >>>> >>>> Under Windows, it is, in fact, a pair of characters comprising a >>>> Carriage >>>> Return character and a Line Feed character in that sequence. It is >>>> often >>>> referred to as a Cr/Lf pair. >>>> >>>> If what you describe is correct then the 'newline' imbbeded between >>>> quotes >>>> is not a 'newline' at all. >>>> >>>> I suspect it is a single character and is either a Cr or a Lf. >>>> >>>> Notepad only recognizes a Cr/Lf pair ans a line-break and therefore >>>> would >>>> show your 'line' unbroken. I further suspect that where the imbeded >>>> 'newline' should have been, Notepad would have shown an 'unprintable' >>>> character which looks like a hollow rectangle. >>>> >>>> The StreamReader.ReadLine() method defines a line as a sequence of >>>> characters followed by a Line Feed character, a Carriage Return >>>> character or >>>> a Cr/Lf pair. >>>> >>>> In your case, the 'ReadToEnd' and a split on Environment.NewLine is the >>>> appropriate course of action. >>>> >>> >>> Unless he's dealing with a large data files, then it's a recipe for >>> disaster. >>> >>> -- >>> Tom Shelton >> >> You are correct, Tom. However, the files in question are always quite >> small eg 4 to 10 rows of data. >> >> I am intrigued, however, as to how you would go about reading in a large >> file of this nature. >> >> Stephany made a correct observation, btw, that the characters in between >> the quotes were, in fact, chr(13) only. I should have noticed that. >> >> Cheers >> >> >> Thanks Stephany my thought process. Cheers Harry Stephany.
> I thought is Stephany writing a SMS coded "Cor or a lover"> I suspect it is a single character and is either a Cr or a Lf. > Before you become angry, I did not read that, only the Cr made me first read Cor, when I was reading bottom up and from right to left. But in fact it shows why I dislike the use of only a line feed or a carriage return as the end of a string. Most persons who use European characters first move their eyes a row lower and then turn their eyes to the left (or vice versa) when they are reading. Cor Harry,
I don't know if it helps you. However, CSV had in the beginning always (only) this format For English use , for embedding a numeric "," for embedding a string For non English use ; for embedding a numeric ";" for embedding a string I thought this it is in fact still the same although a lot of people use their own format where they simple use only comma's. Cor Show quoteHide quote "Harry" <harryNoSpam@ffapaysmart.com.au> wrote in message news:eJtVIan2JHA.240@TK2MSFTNGP06.phx.gbl... > Hi All > > I have a .csv file that contains newline chars embedded between quotes in > a line of text eg > > BSPADV1,John.public,9413,"Sharrock Ashley > TEST STREET 1 > TEST NSW 2200",Address Insufficient,,,Mbase Print > Report,R7TDKPFMDBCLKE07CGJMFKKW6VVB/21,Sharrock Ashley > > There are actually some 19 columns of data but when a > StreamReader.ReadLine method tries to read line by line, it only returns > data up to the first embedded newline chars. Interestingly, the data does > display correctly in Notepad, so I guess Notepad must ignore newline chars > inside quotation marks. > > Is there anyway to read the above line and get the full line of data? My > only thought so far is to use the ReadToEnd method and then try and remove > the newline chars between quotes programatically before Spltting on the > "real" newline chars. > > Thanks for any help > Harry wrote:
Show quoteHide quote > Hi All Harry, have you tried the builtin FileIO.TextFieldParser class?> > I have a .csv file that contains newline chars embedded between quotes in a > line of text eg > > BSPADV1,John.public,9413,"Sharrock Ashley > TEST STREET 1 > TEST NSW 2200",Address Insufficient,,,Mbase Print > Report,R7TDKPFMDBCLKE07CGJMFKKW6VVB/21,Sharrock Ashley > > There are actually some 19 columns of data but when a StreamReader.ReadLine > method tries to read line by line, it only returns data up to the first > embedded newline chars. Interestingly, the data does display correctly in > Notepad, so I guess Notepad must ignore newline chars inside quotation > marks. > > Is there anyway to read the above line and get the full line of data? My > only thought so far is to use the ReadToEnd method and then try and remove > the newline chars between quotes programatically before Spltting on the > "real" newline chars. > > Thanks for any help > > It handles embedded eol characters, including embedded eol or commas within quoted fields broken into two lines. Its handled for you. <g> plus I presumed that it is cached as it reads, and Windows RTL already caches text files, so it should be pretty efficient. Here is an example console application with a ReaderCVS class wrapper and example usage: '-------------------------------------------------------------- ' File : readcsf.vb '-------------------------------------------------------------- Option Strict on Option Explicit On imports system imports system.console imports system.collections.generic Module module1 public Class ReaderCVS public shared data as new List(Of List(Of String)) ' ' Read cvs file with max_fields, optional eolfilter ' public shared function ReadCSF( _ byval fn as string, _ byval max_fields as integer, _ optional byval eolfilter as boolean = true) as boolean try data.clear Dim rdr As FileIO.TextFieldParser rdr = My.Computer.FileSystem.OpenTextFieldParser(fn) rdr.SetDelimiters(",") dim flds as new List(Of String) While Not rdr.EndOfData() dim lines as string() = rdr.ReadFields() For Each fld as string In lines if eolfilter then 'fld = fld.Replace(vbCr,"\r").Replace(vbLf,"\n") fld = fld.Replace(vbCr," ").Replace(vbLf,"") end if flds.Add(fld) if flds.count = max_fields then data.Add(flds) flds = new List(Of String) end if Next End While if flds.count > 0 then data.Add(flds) return true catch ex as exception writeline(ex.message) return false end try end function sub Dump() writeline("------- DUMP ") dim nrow as integer = 0 for each flds as List(Of String) in data nrow += 1 dim nfld as integer = 0 for each f as string in flds nfld += 1 writeline("- r{0,-3} f{1,-3} => [{2}]",nrow, nfld, f) next next end sub end class sub main(ByVal args() As String) dim csf as new ReaderCvs csf.readcsf("harry1.csv", 19) ' 19 fields csf.dump() readkey(true) end sub end Module I used your example text as the harry1.csv test file. I provided a eolfilter boolean to filter the <CR><LF> and played with \r\n and space/blank replacements. --
trouble reading word documents
Good tutorial for working with XML problem reading array data from structure When double precision isn't very precise Controls not rendering how to know if to close sqlreader drag and drop issue Multiple File Select Not Working In Published ClickOnce Application Loading an XML Document? importing data from excel sheet to datagridView |
|||||||||||||||||||||||