Home All Groups Group Topic Archive Search About

Problem with embedded carriage returns

Author
22 May 2009 12:44 AM
Harry
Hi All

I have a .csv file that contains newline chars embedded between quotes in a
line of text eg

BSPADV1,John.public,9413,"Sharrock Ashley
TEST STREET 1
TEST NSW 2200",Address Insufficient,,,Mbase Print
Report,R7TDKPFMDBCLKE07CGJMFKKW6VVB/21,Sharrock Ashley

There are actually some 19 columns of data but when a StreamReader.ReadLine
method tries to read line by line, it only returns data up to the first
embedded newline chars. Interestingly, the data does display correctly in
Notepad, so I guess Notepad must ignore newline chars inside quotation
marks.

Is there anyway to read the above line and get the full line of data? My
only thought so far is to use the ReadToEnd method and then try and remove
the newline chars between quotes programatically before Spltting on the
"real" newline chars.

Thanks for any help

Author
22 May 2009 1:16 AM
Herfried K. Wagner [MVP]
Show quote Hide quote
"Harry" <harryNoSpam@ffapaysmart.com.au> schrieb:
> I have a .csv file that contains newline chars embedded between quotes in
> a line of text eg
>
> BSPADV1,John.public,9413,"Sharrock Ashley
> TEST STREET 1
> TEST NSW 2200",Address Insufficient,,,Mbase Print
> Report,R7TDKPFMDBCLKE07CGJMFKKW6VVB/21,Sharrock Ashley
>
> There are actually some 19 columns of data but when a
> StreamReader.ReadLine method tries to read line by line, it only returns
> data up to the first embedded newline chars. Interestingly, the data does
> display correctly in Notepad, so I guess Notepad must ignore newline chars
> inside quotation marks.

What do you mean by "displays correctly"?  Notepad just displays the text
contained in the file.

> Is there anyway to read the above line and get the full line of data? My
> only thought so far is to use the ReadToEnd method and then try and remove
> the newline chars between quotes programatically before Spltting on the
> "real" newline chars.

This would be one possible approach.  You may want to take a closer look at
regular expressions for simple "parsing" of the text file.  Alternatively
you may want to read the file line-by-line, analyze each line and
concatenate the parts of a row which is split into multiple lines manually.
However, the best approach depends on what exactly you want to achieve.

--
M S   Herfried K. Wagner
M V P  <URL:http://dotnet.mvps.org/>
V B   <URL:http://dotnet.mvps.org/dotnet/faqs/>
Author
22 May 2009 4:33 AM
Harry Strybos
Show quote Hide quote
"Herfried K. Wagner [MVP]" <hirf-spam-me-here@gmx.at> wrote in message
news:%23dukvrn2JHA.1712@TK2MSFTNGP03.phx.gbl...
> "Harry" <harryNoSpam@ffapaysmart.com.au> schrieb:
>> I have a .csv file that contains newline chars embedded between quotes in
>> a line of text eg
>>
>> BSPADV1,John.public,9413,"Sharrock Ashley
>> TEST STREET 1
>> TEST NSW 2200",Address Insufficient,,,Mbase Print
>> Report,R7TDKPFMDBCLKE07CGJMFKKW6VVB/21,Sharrock Ashley
>>
>> There are actually some 19 columns of data but when a
>> StreamReader.ReadLine method tries to read line by line, it only returns
>> data up to the first embedded newline chars. Interestingly, the data does
>> display correctly in Notepad, so I guess Notepad must ignore newline
>> chars inside quotation marks.
>
> What do you mean by "displays correctly"?  Notepad just displays the text
> contained in the file.
>
>> Is there anyway to read the above line and get the full line of data? My
>> only thought so far is to use the ReadToEnd method and then try and
>> remove the newline chars between quotes programatically before Spltting
>> on the "real" newline chars.
>
> This would be one possible approach.  You may want to take a closer look
> at regular expressions for simple "parsing" of the text file.
> Alternatively you may want to read the file line-by-line, analyze each
> line and concatenate the parts of a row which is split into multiple lines
> manually. However, the best approach depends on what exactly you want to
> achieve.
>
> --
> M S   Herfried K. Wagner
> M V P  <URL:http://dotnet.mvps.org/>
> V B   <URL:http://dotnet.mvps.org/dotnet/faqs/>

Thanks Herfreid for you answer and you excellent support of this group.

I have solved the problem by getting the StreamReader to load the entire
file into a string var and then doing a
buffer.Split(CChar(Environment.Newline)). I then read all the lines from the
array produced.

Seems SteamReader.Readline only grabs a row of data upto the first newline
char it encounters (kinda makes sense)

The method I have now employed seems to ignore any newline characters
enclosed in quotation marks.

Thanks again for your help

Harry
Author
22 May 2009 4:53 AM
Stephany Young
You issue stems from your considering a 'newline' to be a single character.

Under Windows, it is, in fact, a pair of characters comprising a Carriage
Return character and a Line Feed character in that sequence. It is often
referred to as a Cr/Lf pair.

If what you describe is correct then the 'newline' imbbeded between quotes
is not a 'newline' at all.

I suspect it is a single character and is either a Cr or a Lf.

Notepad only recognizes a Cr/Lf pair ans a line-break and therefore would
show your 'line' unbroken. I further suspect that where the imbeded
'newline' should have been, Notepad would have shown an 'unprintable'
character which looks like a hollow rectangle.

The StreamReader.ReadLine() method defines a line as a sequence of
characters followed by a Line Feed character, a Carriage Return character or
a Cr/Lf pair.

In your case, the 'ReadToEnd' and a split on Environment.NewLine is the
appropriate course of action.


Show quoteHide quote
"Harry Strybos" <harrystry***@optusnet.com.au> wrote in message
news:OwZJqYp2JHA.1092@TK2MSFTNGP06.phx.gbl...
> "Herfried K. Wagner [MVP]" <hirf-spam-me-here@gmx.at> wrote in message
> news:%23dukvrn2JHA.1712@TK2MSFTNGP03.phx.gbl...
>> "Harry" <harryNoSpam@ffapaysmart.com.au> schrieb:
>>> I have a .csv file that contains newline chars embedded between quotes
>>> in a line of text eg
>>>
>>> BSPADV1,John.public,9413,"Sharrock Ashley
>>> TEST STREET 1
>>> TEST NSW 2200",Address Insufficient,,,Mbase Print
>>> Report,R7TDKPFMDBCLKE07CGJMFKKW6VVB/21,Sharrock Ashley
>>>
>>> There are actually some 19 columns of data but when a
>>> StreamReader.ReadLine method tries to read line by line, it only returns
>>> data up to the first embedded newline chars. Interestingly, the data
>>> does display correctly in Notepad, so I guess Notepad must ignore
>>> newline chars inside quotation marks.
>>
>> What do you mean by "displays correctly"?  Notepad just displays the text
>> contained in the file.
>>
>>> Is there anyway to read the above line and get the full line of data? My
>>> only thought so far is to use the ReadToEnd method and then try and
>>> remove the newline chars between quotes programatically before Spltting
>>> on the "real" newline chars.
>>
>> This would be one possible approach.  You may want to take a closer look
>> at regular expressions for simple "parsing" of the text file.
>> Alternatively you may want to read the file line-by-line, analyze each
>> line and concatenate the parts of a row which is split into multiple
>> lines manually. However, the best approach depends on what exactly you
>> want to achieve.
>>
>> --
>> M S   Herfried K. Wagner
>> M V P  <URL:http://dotnet.mvps.org/>
>> V B   <URL:http://dotnet.mvps.org/dotnet/faqs/>
>
> Thanks Herfreid for you answer and you excellent support of this group.
>
> I have solved the problem by getting the StreamReader to load the entire
> file into a string var and then doing a
> buffer.Split(CChar(Environment.Newline)). I then read all the lines from
> the array produced.
>
> Seems SteamReader.Readline only grabs a row of data upto the first newline
> char it encounters (kinda makes sense)
>
> The method I have now employed seems to ignore any newline characters
> enclosed in quotation marks.
>
> Thanks again for your help
>
> Harry
>
Author
22 May 2009 5:39 AM
Tom Shelton
On 2009-05-22, Stephany Young <noone@localhost> wrote:
Show quoteHide quote
> You issue stems from your considering a 'newline' to be a single character.
>
> Under Windows, it is, in fact, a pair of characters comprising a Carriage
> Return character and a Line Feed character in that sequence. It is often
> referred to as a Cr/Lf pair.
>
> If what you describe is correct then the 'newline' imbbeded between quotes
> is not a 'newline' at all.
>
> I suspect it is a single character and is either a Cr or a Lf.
>
> Notepad only recognizes a Cr/Lf pair ans a line-break and therefore would
> show your 'line' unbroken. I further suspect that where the imbeded
> 'newline' should have been, Notepad would have shown an 'unprintable'
> character which looks like a hollow rectangle.
>
> The StreamReader.ReadLine() method defines a line as a sequence of
> characters followed by a Line Feed character, a Carriage Return character or
> a Cr/Lf pair.
>
> In your case, the 'ReadToEnd' and a split on Environment.NewLine is the
> appropriate course of action.
>

Unless he's dealing with a large data files, then it's a recipe for disaster.

--
Tom Shelton
Author
22 May 2009 8:14 PM
Harry
Show quote Hide quote
"Tom Shelton" <tom_shel***@comcastXXXXXXX.net> wrote in message
news:uYAVu%23p2JHA.4744@TK2MSFTNGP04.phx.gbl...
> On 2009-05-22, Stephany Young <noone@localhost> wrote:
>> You issue stems from your considering a 'newline' to be a single
>> character.
>>
>> Under Windows, it is, in fact, a pair of characters comprising a Carriage
>> Return character and a Line Feed character in that sequence. It is often
>> referred to as a Cr/Lf pair.
>>
>> If what you describe is correct then the 'newline' imbbeded between
>> quotes
>> is not a 'newline' at all.
>>
>> I suspect it is a single character and is either a Cr or a Lf.
>>
>> Notepad only recognizes a Cr/Lf pair ans a line-break and therefore would
>> show your 'line' unbroken. I further suspect that where the imbeded
>> 'newline' should have been, Notepad would have shown an 'unprintable'
>> character which looks like a hollow rectangle.
>>
>> The StreamReader.ReadLine() method defines a line as a sequence of
>> characters followed by a Line Feed character, a Carriage Return character
>> or
>> a Cr/Lf pair.
>>
>> In your case, the 'ReadToEnd' and a split on Environment.NewLine is the
>> appropriate course of action.
>>
>
> Unless he's dealing with a large data files, then it's a recipe for
> disaster.
>
> --
> Tom Shelton

You are correct, Tom. However, the files in question are always quite small
eg 4 to 10 rows of data.

I am intrigued, however, as to how you would go about reading in a large
file of this nature.

Stephany made a correct observation, btw, that the characters in between the
quotes were, in fact, chr(13) only. I should have noticed that.

Cheers
Author
22 May 2009 9:00 PM
Stephany Young
Well now you've opened a can of worms Harry.

There is no silver bullet for this and there are a myriad of techniques you
can use.

The first thing you need to do is consider the provenance of the input file.

Is the program that creates the file a program that you can have some
influence over.

If so, then the technique that I would favour is to have write, what I call,
a special 'guard' character at the end of each full line. To read such a
file you would then use the ReadLine() method in a loop in the usual manner,
but, within each iteration of the loop, check to see if your buffer ends
with the 'guard character. If it does then you have the full line. If it
doesn't then repeatedly call the ReadLine() method and append the result to
your buffer until it does. Once you have a full line then process it in the
normal way and continue the main loop.

If not, then you need to jump throught some hoops. If the input file is
well-formed in that an opening quote always has a closing quote and the
extra 'newline's always fall betwen a pair of quotes then instead of
checking for a 'guard' character you could check to see if your buffer has
an even number of quote characters. Again append the result of interim
ReadLine() method calls until it does. This technique, of course, is less
efficient becaue it requires constant inspection of individual characters in
strings.

Another technique is to forget about the ReadLine() method completely,
decide on an arbitrary 'block' size and read that number of characters at a
time. For each block, find the last instance of a Cr/Lf pair. Append
everything before that pair to the remainder from the previous read as your
substantive block and save everything after that pair as the remainder.
Split the substantive block on Cr/Lf and process ecah line in the normal
way.

Mileage will vary when using any of these techniques due to file size,
length of line, amount of memory resources available etc.

The other factor is that everyone that you talk to will have their own
favourite technique that, of course, they consider to be far superior to
anybody else's ;)


Show quoteHide quote
"Harry" <harryNoSpam@ffapaysmart.com.au> wrote in message
news:%23J0$Fox2JHA.4368@TK2MSFTNGP05.phx.gbl...
> "Tom Shelton" <tom_shel***@comcastXXXXXXX.net> wrote in message
> news:uYAVu%23p2JHA.4744@TK2MSFTNGP04.phx.gbl...
>> On 2009-05-22, Stephany Young <noone@localhost> wrote:
>>> You issue stems from your considering a 'newline' to be a single
>>> character.
>>>
>>> Under Windows, it is, in fact, a pair of characters comprising a
>>> Carriage
>>> Return character and a Line Feed character in that sequence. It is often
>>> referred to as a Cr/Lf pair.
>>>
>>> If what you describe is correct then the 'newline' imbbeded between
>>> quotes
>>> is not a 'newline' at all.
>>>
>>> I suspect it is a single character and is either a Cr or a Lf.
>>>
>>> Notepad only recognizes a Cr/Lf pair ans a line-break and therefore
>>> would
>>> show your 'line' unbroken. I further suspect that where the imbeded
>>> 'newline' should have been, Notepad would have shown an 'unprintable'
>>> character which looks like a hollow rectangle.
>>>
>>> The StreamReader.ReadLine() method defines a line as a sequence of
>>> characters followed by a Line Feed character, a Carriage Return
>>> character or
>>> a Cr/Lf pair.
>>>
>>> In your case, the 'ReadToEnd' and a split on Environment.NewLine is the
>>> appropriate course of action.
>>>
>>
>> Unless he's dealing with a large data files, then it's a recipe for
>> disaster.
>>
>> --
>> Tom Shelton
>
> You are correct, Tom. However, the files in question are always quite
> small eg 4 to 10 rows of data.
>
> I am intrigued, however, as to how you would go about reading in a large
> file of this nature.
>
> Stephany made a correct observation, btw, that the characters in between
> the quotes were, in fact, chr(13) only. I should have noticed that.
>
> Cheers
>
>
>
Author
23 May 2009 5:24 AM
Harry Strybos
Show quote Hide quote
"Stephany Young" <noone@localhost> wrote in message
news:%23su9TBy2JHA.1416@TK2MSFTNGP04.phx.gbl...
> Well now you've opened a can of worms Harry.
>
> There is no silver bullet for this and there are a myriad of techniques
> you can use.
>
> The first thing you need to do is consider the provenance of the input
> file.
>
> Is the program that creates the file a program that you can have some
> influence over.
>
> If so, then the technique that I would favour is to have write, what I
> call, a special 'guard' character at the end of each full line. To read
> such a file you would then use the ReadLine() method in a loop in the
> usual manner, but, within each iteration of the loop, check to see if your
> buffer ends with the 'guard character. If it does then you have the full
> line. If it doesn't then repeatedly call the ReadLine() method and append
> the result to your buffer until it does. Once you have a full line then
> process it in the normal way and continue the main loop.
>
> If not, then you need to jump throught some hoops. If the input file is
> well-formed in that an opening quote always has a closing quote and the
> extra 'newline's always fall betwen a pair of quotes then instead of
> checking for a 'guard' character you could check to see if your buffer has
> an even number of quote characters. Again append the result of interim
> ReadLine() method calls until it does. This technique, of course, is less
> efficient becaue it requires constant inspection of individual characters
> in strings.
>
> Another technique is to forget about the ReadLine() method completely,
> decide on an arbitrary 'block' size and read that number of characters at
> a time. For each block, find the last instance of a Cr/Lf pair. Append
> everything before that pair to the remainder from the previous read as
> your substantive block and save everything after that pair as the
> remainder. Split the substantive block on Cr/Lf and process ecah line in
> the normal way.
>
> Mileage will vary when using any of these techniques due to file size,
> length of line, amount of memory resources available etc.
>
> The other factor is that everyone that you talk to will have their own
> favourite technique that, of course, they consider to be far superior to
> anybody else's ;)
>
>
> "Harry" <harryNoSpam@ffapaysmart.com.au> wrote in message
> news:%23J0$Fox2JHA.4368@TK2MSFTNGP05.phx.gbl...
>> "Tom Shelton" <tom_shel***@comcastXXXXXXX.net> wrote in message
>> news:uYAVu%23p2JHA.4744@TK2MSFTNGP04.phx.gbl...
>>> On 2009-05-22, Stephany Young <noone@localhost> wrote:
>>>> You issue stems from your considering a 'newline' to be a single
>>>> character.
>>>>
>>>> Under Windows, it is, in fact, a pair of characters comprising a
>>>> Carriage
>>>> Return character and a Line Feed character in that sequence. It is
>>>> often
>>>> referred to as a Cr/Lf pair.
>>>>
>>>> If what you describe is correct then the 'newline' imbbeded between
>>>> quotes
>>>> is not a 'newline' at all.
>>>>
>>>> I suspect it is a single character and is either a Cr or a Lf.
>>>>
>>>> Notepad only recognizes a Cr/Lf pair ans a line-break and therefore
>>>> would
>>>> show your 'line' unbroken. I further suspect that where the imbeded
>>>> 'newline' should have been, Notepad would have shown an 'unprintable'
>>>> character which looks like a hollow rectangle.
>>>>
>>>> The StreamReader.ReadLine() method defines a line as a sequence of
>>>> characters followed by a Line Feed character, a Carriage Return
>>>> character or
>>>> a Cr/Lf pair.
>>>>
>>>> In your case, the 'ReadToEnd' and a split on Environment.NewLine is the
>>>> appropriate course of action.
>>>>
>>>
>>> Unless he's dealing with a large data files, then it's a recipe for
>>> disaster.
>>>
>>> --
>>> Tom Shelton
>>
>> You are correct, Tom. However, the files in question are always quite
>> small eg 4 to 10 rows of data.
>>
>> I am intrigued, however, as to how you would go about reading in a large
>> file of this nature.
>>
>> Stephany made a correct observation, btw, that the characters in between
>> the quotes were, in fact, chr(13) only. I should have noticed that.
>>
>> Cheers
>>
>>
>>
Thanks Stephany

As usual your astute observations are spot on. You have raised an event in
my thought process.

Cheers

Harry
Author
22 May 2009 6:26 AM
Cor Ligthert[MVP]
Stephany.
>
> I suspect it is a single character and is either a Cr or a Lf.
>

I thought is Stephany writing a SMS coded "Cor or a lover"

Before you become angry, I did not read that, only the Cr made me first read
Cor, when I was reading bottom up and from right to left.

But in fact it shows why I dislike the use of only a line feed or a carriage
return as the end of a string.

Most persons who use European characters  first move their eyes a row lower
and then turn their eyes to the left (or vice versa) when they are reading.

Cor
Author
22 May 2009 6:08 AM
Cor Ligthert[MVP]
Harry,

I don't know if it helps you.

However, CSV had in the beginning always (only) this format

For English use
    ,     for embedding a numeric
    ","  for embedding a string

For non English use
    ;    for embedding a numeric
    ";" for embedding a string

I thought this it is in fact still the same although a lot of people use
their own format where they simple use only comma's.

Cor

Show quoteHide quote
"Harry" <harryNoSpam@ffapaysmart.com.au> wrote in message
news:eJtVIan2JHA.240@TK2MSFTNGP06.phx.gbl...
> Hi All
>
> I have a .csv file that contains newline chars embedded between quotes in
> a line of text eg
>
> BSPADV1,John.public,9413,"Sharrock Ashley
> TEST STREET 1
> TEST NSW 2200",Address Insufficient,,,Mbase Print
> Report,R7TDKPFMDBCLKE07CGJMFKKW6VVB/21,Sharrock Ashley
>
> There are actually some 19 columns of data but when a
> StreamReader.ReadLine method tries to read line by line, it only returns
> data up to the first embedded newline chars. Interestingly, the data does
> display correctly in Notepad, so I guess Notepad must ignore newline chars
> inside quotation marks.
>
> Is there anyway to read the above line and get the full line of data? My
> only thought so far is to use the ReadToEnd method and then try and remove
> the newline chars between quotes programatically before Spltting on the
> "real" newline chars.
>
> Thanks for any help
>
Author
23 May 2009 7:53 AM
Mike
Harry wrote:
Show quoteHide quote
> Hi All
>
> I have a .csv file that contains newline chars embedded between quotes in a
> line of text eg
>
> BSPADV1,John.public,9413,"Sharrock Ashley
> TEST STREET 1
> TEST NSW 2200",Address Insufficient,,,Mbase Print
> Report,R7TDKPFMDBCLKE07CGJMFKKW6VVB/21,Sharrock Ashley
>
> There are actually some 19 columns of data but when a StreamReader.ReadLine
> method tries to read line by line, it only returns data up to the first
> embedded newline chars. Interestingly, the data does display correctly in
> Notepad, so I guess Notepad must ignore newline chars inside quotation
> marks.
>
> Is there anyway to read the above line and get the full line of data? My
> only thought so far is to use the ReadToEnd method and then try and remove
> the newline chars between quotes programatically before Spltting on the
> "real" newline chars.
>
> Thanks for any help
>
>

Harry, have you tried the builtin FileIO.TextFieldParser class?

It handles embedded eol characters, including embedded eol or commas
within quoted fields broken into two lines. Its handled for you. <g>
plus I presumed that it is cached as it reads, and Windows RTL already
caches text files, so it should be pretty efficient.

Here is an example console application with a ReaderCVS class wrapper
and example usage:

'--------------------------------------------------------------
' File : readcsf.vb
'--------------------------------------------------------------
Option Strict on
Option Explicit On

imports system
imports system.console
imports system.collections.generic

Module module1

     public Class ReaderCVS

        public shared data as new List(Of List(Of String))

        '
        ' Read cvs file with max_fields, optional eolfilter
        '
        public shared function ReadCSF( _
              byval fn as string, _
              byval max_fields as integer, _
              optional byval eolfilter as boolean = true) as boolean
          try
            data.clear
            Dim rdr As FileIO.TextFieldParser
            rdr = My.Computer.FileSystem.OpenTextFieldParser(fn)
            rdr.SetDelimiters(",")

            dim flds as new List(Of String)
            While Not rdr.EndOfData()
               dim lines as string() = rdr.ReadFields()
               For Each fld as string In lines
                   if eolfilter then
                      'fld = fld.Replace(vbCr,"\r").Replace(vbLf,"\n")
                      fld = fld.Replace(vbCr," ").Replace(vbLf,"")
                   end if
                   flds.Add(fld)
                   if flds.count = max_fields then
                      data.Add(flds)
                      flds = new List(Of String)
                   end if
               Next
            End While
            if flds.count > 0 then  data.Add(flds)
            return true

          catch ex as exception
             writeline(ex.message)
             return false
          end try
        end function

        sub Dump()
          writeline("------- DUMP ")
          dim nrow as integer = 0
          for each flds as List(Of String) in data
            nrow += 1
            dim nfld as integer = 0
            for each f as string in flds
              nfld += 1
              writeline("- r{0,-3} f{1,-3} => [{2}]",nrow, nfld, f)
            next
          next
        end sub

     end class

    sub main(ByVal args() As String)
       dim csf as new ReaderCvs
       csf.readcsf("harry1.csv", 19) ' 19 fields
       csf.dump()
       readkey(true)
    end sub

end Module

I used your example text as the harry1.csv test file.

I provided a eolfilter boolean to filter the <CR><LF> and played with
\r\n and space/blank replacements.

--