Home All Groups Group Topic Archive Search About

Suggestions to reduce memory use when splitting a string

Author
3 Apr 2006 6:15 PM
klineb
Good Day,

I have written and utility to convert our DOS COBOL data files to a SQL
Server database.  Part of the process requires parsing each line into a
sql statement and validting the data to keep the integrity of the
database.  We are parsing roughl 81 files and range in size 1 kb to 65
MB files (Average of 400,000 lines in the larger files).

I have written this utility with VB.NET 2003 and when I parse all of
the files I run out of memory.  The following functions seems to be the
main source of my leak.  Any help optimizing this code is appreciated.

Public Function SplitDelimitedLine(ByVal CurrentLine As StringBuilder,
_
            ByRef SplitString() As String) As Boolean
        '7-25-2005
        'BJK
        '
        '--removed the use of Char replaced with: CurrentLine.Chars(i)
        '
        '---------------------------------------
        Dim i As Integer
        Dim CountDelimiter As Boolean
        Dim Total As Integer
        Dim lbResult As Boolean
        Dim Section As New StringBuilder
        Dim liLen As Integer
        Dim liCommaPos As Integer
        Dim liDQuotePos As Integer

        Try
            'We want to count the delimiter unless it is within the
text qualifier
            CountDelimiter = True
            Total = 0
            liLen = CurrentLine.Length - 1
            For i = 0 To liLen
                Select Case CurrentLine.Chars(i)
                    Case gsDoubleQoute
                        If CountDelimiter Then
                            CountDelimiter = False
                        Else
                            CountDelimiter = True
                        End If
                    Case gsComma
                        If CountDelimiter Then
                            ' Add current section to collection
                            SplitString(Total) = Section.ToString.Trim
                            Section = Nothing
                            Section = New StringBuilder
                            Total = Total + 1
                        Else
                            Section.Append(CurrentLine.Chars(i))
                        End If
                    Case Else
                        Section.Append(CurrentLine.Chars(i))
                End Select
            Next
            ' Get the last field - as most files will not have an
ending delimiter
            If CountDelimiter Then
                ' Add current section to collection
                SplitString(Total) = Section.ToString
            End If
            lbResult = True
        Catch ex As Exception
            ps_LastErrSource = ex.Source
            ps_LastErrDesc = ex.ToString
            lbResult = False
            Dim loSB As New StringBuilder(ps_LastErrDesc)
            UpdateLog(loSB)
        End Try
        Return lbResult
    End Function

This function is stored in a class and is called from other function
within this class.
Thanks
Brian

Author
3 Apr 2006 6:38 PM
tommaso.gastaldi
hi Brian,

I would do that using a buffer. You open a StreamWriter and when the
processed text (which you can store in a StringBuilder (sb) ) reaches a
given size, for instance 2MB (or whatever you like, the sb holds more
than 2GB), you can flush it to disk an start again with an empty
buffer. To empty a sb it's sufficient to set the length to 0. When
input ends, you flush to disk what remains in the sb. Remember to close
the StreamWriter.

-t

klineb ha scritto:
Show quoteHide quote
> Good Day,
>
> I have written and utility to convert our DOS COBOL data files to a SQL
> Server database.  Part of the process requires parsing each line into a
> sql statement and validting the data to keep the integrity of the
> database.  We are parsing roughl 81 files and range in size 1 kb to 65
> MB files (Average of 400,000 lines in the larger files).
>
> I have written this utility with VB.NET 2003 and when I parse all of
> the files I run out of memory.  The following functions seems to be the
> main source of my leak.  Any help optimizing this code is appreciated.
>
> Public Function SplitDelimitedLine(ByVal CurrentLine As StringBuilder,
> _
>             ByRef SplitString() As String) As Boolean
>         '7-25-2005
>         'BJK
>         '
>         '--removed the use of Char replaced with: CurrentLine.Chars(i)
>         '
>         '---------------------------------------
>         Dim i As Integer
>         Dim CountDelimiter As Boolean
>         Dim Total As Integer
>         Dim lbResult As Boolean
>         Dim Section As New StringBuilder
>         Dim liLen As Integer
>         Dim liCommaPos As Integer
>         Dim liDQuotePos As Integer
>
>         Try
>             'We want to count the delimiter unless it is within the
> text qualifier
>             CountDelimiter = True
>             Total = 0
>             liLen = CurrentLine.Length - 1
>             For i = 0 To liLen
>                 Select Case CurrentLine.Chars(i)
>                     Case gsDoubleQoute
>                         If CountDelimiter Then
>                             CountDelimiter = False
>                         Else
>                             CountDelimiter = True
>                         End If
>                     Case gsComma
>                         If CountDelimiter Then
>                             ' Add current section to collection
>                             SplitString(Total) = Section.ToString.Trim
>                             Section = Nothing
>                             Section = New StringBuilder
>                             Total = Total + 1
>                         Else
>                             Section.Append(CurrentLine.Chars(i))
>                         End If
>                     Case Else
>                         Section.Append(CurrentLine.Chars(i))
>                 End Select
>             Next
>             ' Get the last field - as most files will not have an
> ending delimiter
>             If CountDelimiter Then
>                 ' Add current section to collection
>                 SplitString(Total) = Section.ToString
>             End If
>             lbResult = True
>         Catch ex As Exception
>             ps_LastErrSource = ex.Source
>             ps_LastErrDesc = ex.ToString
>             lbResult = False
>             Dim loSB As New StringBuilder(ps_LastErrDesc)
>             UpdateLog(loSB)
>         End Try
>         Return lbResult
>     End Function
>
> This function is stored in a class and is called from other function
> within this class.
> Thanks
> Brian
Author
3 Apr 2006 6:44 PM
Marina Levit [MVP]
Have you thought about regular expressions to parse the fields out of each
line? Going character by character seems really inefficient. At the very
last, use functions like IndexOf, to find your delimiters, and parse out the
pieces that way.

Show quoteHide quote
"klineb" <briankl***@hotmail.com> wrote in message
news:1144088149.283259.318880@g10g2000cwb.googlegroups.com...
> Good Day,
>
> I have written and utility to convert our DOS COBOL data files to a SQL
> Server database.  Part of the process requires parsing each line into a
> sql statement and validting the data to keep the integrity of the
> database.  We are parsing roughl 81 files and range in size 1 kb to 65
> MB files (Average of 400,000 lines in the larger files).
>
> I have written this utility with VB.NET 2003 and when I parse all of
> the files I run out of memory.  The following functions seems to be the
> main source of my leak.  Any help optimizing this code is appreciated.
>
> Public Function SplitDelimitedLine(ByVal CurrentLine As StringBuilder,
> _
>            ByRef SplitString() As String) As Boolean
>        '7-25-2005
>        'BJK
>        '
>        '--removed the use of Char replaced with: CurrentLine.Chars(i)
>        '
>        '---------------------------------------
>        Dim i As Integer
>        Dim CountDelimiter As Boolean
>        Dim Total As Integer
>        Dim lbResult As Boolean
>        Dim Section As New StringBuilder
>        Dim liLen As Integer
>        Dim liCommaPos As Integer
>        Dim liDQuotePos As Integer
>
>        Try
>            'We want to count the delimiter unless it is within the
> text qualifier
>            CountDelimiter = True
>            Total = 0
>            liLen = CurrentLine.Length - 1
>            For i = 0 To liLen
>                Select Case CurrentLine.Chars(i)
>                    Case gsDoubleQoute
>                        If CountDelimiter Then
>                            CountDelimiter = False
>                        Else
>                            CountDelimiter = True
>                        End If
>                    Case gsComma
>                        If CountDelimiter Then
>                            ' Add current section to collection
>                            SplitString(Total) = Section.ToString.Trim
>                            Section = Nothing
>                            Section = New StringBuilder
>                            Total = Total + 1
>                        Else
>                            Section.Append(CurrentLine.Chars(i))
>                        End If
>                    Case Else
>                        Section.Append(CurrentLine.Chars(i))
>                End Select
>            Next
>            ' Get the last field - as most files will not have an
> ending delimiter
>            If CountDelimiter Then
>                ' Add current section to collection
>                SplitString(Total) = Section.ToString
>            End If
>            lbResult = True
>        Catch ex As Exception
>            ps_LastErrSource = ex.Source
>            ps_LastErrDesc = ex.ToString
>            lbResult = False
>            Dim loSB As New StringBuilder(ps_LastErrDesc)
>            UpdateLog(loSB)
>        End Try
>        Return lbResult
>    End Function
>
> This function is stored in a class and is called from other function
> within this class.
> Thanks
> Brian
>
Author
3 Apr 2006 6:59 PM
David Browne
Show quote Hide quote
"klineb" <briankl***@hotmail.com> wrote in message
news:1144088149.283259.318880@g10g2000cwb.googlegroups.com...
> Good Day,
>
> I have written and utility to convert our DOS COBOL data files to a SQL
> Server database.  Part of the process requires parsing each line into a
> sql statement and validting the data to keep the integrity of the
> database.  We are parsing roughl 81 files and range in size 1 kb to 65
> MB files (Average of 400,000 lines in the larger files).
>
> I have written this utility with VB.NET 2003 and when I parse all of
> the files I run out of memory.  The following functions seems to be the
> main source of my leak.  Any help optimizing this code is appreciated.
>

There are a few tweaks you can apply here to reduce memory utilization (in
particular you can perhaps reuse the StringBuilders).  However there's
nothing obvious and terrible here.  In short this function should not cause
an out of memory error.

David
Author
3 Apr 2006 8:10 PM
tomb
klineb wrote:

Show quoteHide quote
>Good Day,
>
>I have written and utility to convert our DOS COBOL data files to a SQL
>Server database.  Part of the process requires parsing each line into a
>sql statement and validting the data to keep the integrity of the
>database.  We are parsing roughl 81 files and range in size 1 kb to 65
>MB files (Average of 400,000 lines in the larger files).
>
>I have written this utility with VB.NET 2003 and when I parse all of
>the files I run out of memory.  The following functions seems to be the
>main source of my leak.  Any help optimizing this code is appreciated.
>
>Public Function SplitDelimitedLine(ByVal CurrentLine As StringBuilder,
>_
>            ByRef SplitString() As String) As Boolean
>        '7-25-2005
>        'BJK
>        '
>        '--removed the use of Char replaced with: CurrentLine.Chars(i)
>        '
>        '---------------------------------------
>        Dim i As Integer
>        Dim CountDelimiter As Boolean
>        Dim Total As Integer
>        Dim lbResult As Boolean
>        Dim Section As New StringBuilder
>        Dim liLen As Integer
>        Dim liCommaPos As Integer
>        Dim liDQuotePos As Integer
>
>        Try
>            'We want to count the delimiter unless it is within the
>text qualifier
>            CountDelimiter = True
>            Total = 0
>            liLen = CurrentLine.Length - 1
>            For i = 0 To liLen
>                Select Case CurrentLine.Chars(i)
>                    Case gsDoubleQoute
>                        If CountDelimiter Then
>                            CountDelimiter = False
>                        Else
>                            CountDelimiter = True
>                        End If
>                    Case gsComma
>                        If CountDelimiter Then
>                            ' Add current section to collection
>                            SplitString(Total) = Section.ToString.Trim
>                            Section = Nothing
>                            Section = New StringBuilder
>                            Total = Total + 1
>                        Else
>                            Section.Append(CurrentLine.Chars(i))
>                        End If
>                    Case Else
>                        Section.Append(CurrentLine.Chars(i))
>                End Select
>            Next
>            ' Get the last field - as most files will not have an
>ending delimiter
>            If CountDelimiter Then
>                ' Add current section to collection
>                SplitString(Total) = Section.ToString
>            End If
>            lbResult = True
>        Catch ex As Exception
>            ps_LastErrSource = ex.Source
>            ps_LastErrDesc = ex.ToString
>            lbResult = False
>            Dim loSB As New StringBuilder(ps_LastErrDesc)
>            UpdateLog(loSB)
>        End Try
>        Return lbResult
>    End Function
>
>This function is stored in a class and is called from other function
>within this class.
>Thanks
>Brian
>

>
It looks to me like your field separator is always a comma.
So why not just use SplitString = split(CurrentLine,","), then use
Replace on each string in the resultant array if you want to get rid of
the quotes.  I think this will be much faster than going char by char.
The parameter SplitString will have to be declared as  byRef SplitString
as Array, rather than SplitString() as String.

Tom
Author
3 Apr 2006 8:17 PM
Marina Levit [MVP]
Actually, arrays are reference types. So modifying the contents of the
array, will work just fine as far as filling it. It doesn't need to be
ByRef.

Show quoteHide quote
"tomb" <t***@technetcenter.com> wrote in message
news:8hfYf.32917$67.29464@bignews6.bellsouth.net...
> klineb wrote:
>
>>Good Day,
>>
>>I have written and utility to convert our DOS COBOL data files to a SQL
>>Server database.  Part of the process requires parsing each line into a
>>sql statement and validting the data to keep the integrity of the
>>database.  We are parsing roughl 81 files and range in size 1 kb to 65
>>MB files (Average of 400,000 lines in the larger files).
>>
>>I have written this utility with VB.NET 2003 and when I parse all of
>>the files I run out of memory.  The following functions seems to be the
>>main source of my leak.  Any help optimizing this code is appreciated.
>>
>>Public Function SplitDelimitedLine(ByVal CurrentLine As StringBuilder,
>>_
>>            ByRef SplitString() As String) As Boolean
>>        '7-25-2005
>>        'BJK
>>        '
>>        '--removed the use of Char replaced with: CurrentLine.Chars(i)
>>        '
>>        '---------------------------------------
>>        Dim i As Integer
>>        Dim CountDelimiter As Boolean
>>        Dim Total As Integer
>>        Dim lbResult As Boolean
>>        Dim Section As New StringBuilder
>>        Dim liLen As Integer
>>        Dim liCommaPos As Integer
>>        Dim liDQuotePos As Integer
>>
>>        Try
>>            'We want to count the delimiter unless it is within the
>>text qualifier
>>            CountDelimiter = True
>>            Total = 0
>>            liLen = CurrentLine.Length - 1
>>            For i = 0 To liLen
>>                Select Case CurrentLine.Chars(i)
>>                    Case gsDoubleQoute
>>                        If CountDelimiter Then
>>                            CountDelimiter = False
>>                        Else
>>                            CountDelimiter = True
>>                        End If
>>                    Case gsComma
>>                        If CountDelimiter Then
>>                            ' Add current section to collection
>>                            SplitString(Total) = Section.ToString.Trim
>>                            Section = Nothing
>>                            Section = New StringBuilder
>>                            Total = Total + 1
>>                        Else
>>                            Section.Append(CurrentLine.Chars(i))
>>                        End If
>>                    Case Else
>>                        Section.Append(CurrentLine.Chars(i))
>>                End Select
>>            Next
>>            ' Get the last field - as most files will not have an
>>ending delimiter
>>            If CountDelimiter Then
>>                ' Add current section to collection
>>                SplitString(Total) = Section.ToString
>>            End If
>>            lbResult = True
>>        Catch ex As Exception
>>            ps_LastErrSource = ex.Source
>>            ps_LastErrDesc = ex.ToString
>>            lbResult = False
>>            Dim loSB As New StringBuilder(ps_LastErrDesc)
>>            UpdateLog(loSB)
>>        End Try
>>        Return lbResult
>>    End Function
>>
>>This function is stored in a class and is called from other function
>>within this class.
>>Thanks
>>Brian
>>
>>
> It looks to me like your field separator is always a comma. So why not
> just use SplitString = split(CurrentLine,","), then use Replace on each
> string in the resultant array if you want to get rid of the quotes.  I
> think this will be much faster than going char by char.
> The parameter SplitString will have to be declared as  byRef SplitString
> as Array, rather than SplitString() as String.
>
> Tom
Author
4 Apr 2006 1:52 AM
klineb
Tom,

Some of the fields are comment field that contain comma.

Ex.   1,2,"This is some, sample text",19.90,

Using SplitSting will not handle this.
Author
4 Apr 2006 4:07 AM
Stephany Young
Try this:

  Public Function SplitDelimitedLine(ByVal CurrentLine As String) As
String()

    Try
      Dim _wl As String = String.Empty  ' work string
      ' Create a local copy of CurrentLine
      ' We don't really need to but I have for the sake of clarity
      ' The 3 Trim's peel of any extraneous whitespace then any leading
and/or trailing commas and then any extraneous whitespace thet might have
been any leading and/or trailing commas that might have been present
      Dim _cl As String = CurrentLine.Trim.Trim(","c).Trim
      ' Find the first " character
      Dim _pos As Integer = _cl.IndexOf(""""c)
      ' If _pos = -1 then there weren't any
      ' Loop until there are no more " characters
      While _pos > -1
        ' Append every thing before the first " character to the work string
        _wl &= _cl.Substring(0, _pos)

        ' Remove every thing before the first " character from the local
copy
        _cl = _cl.Remove(0, _pos)
        ' Find the next " character
        ' Note that we start the find from the 2nd position because we know
that there is a " character in position 1 (index 0)
        _pos = _cl.IndexOf(""""c, 1)
        If _pos > -1 Then
           ' If we find one then we append every thing from the first " to
the 2nd " inclusive to the work string, replacing any commas in the
substring with a tilde and remove the same number of characters from the
local copy
          _wl &= _cl.Substring(0, _pos + 1).Replace(","c, "~"c)
          _cl = _cl.Remove(0, _pos + 1)

          ' Find the next " character
          ' Note that we are now back to finding from the beginning of what
is left of the local copy
          _pos = _cl.IndexOf(""""c)
        End If
      End While
      ' There are no moe " characters so append the remainder of the local
copy to the work string
      _wl &= _cl
      ' Split the work string using the comma as the delimiter
      Dim _ss As String() = _wl.Replace("""", String.Empty).Split(","c)
      ' Work through the elements of the array
      For _i As Integer = 0 To _ss.Length - 1
        If the element contains any tilde characters then replace them with
commas
        If _ss(_i).IndexOf("~"c) > -1 Then _ss(_i) = _ss(_i).Replace("~"c,
","c)
       'Trim any whitespaces from the element
        _ss(_i) = _ss(_i).Trim
      Next
      ' Return the string array
      Return _ss
    Catch
      ' We hit a problem of some description so return Nothing (null)
      Return Nothing
    End Try

  End Function

  Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As
System.EventArgs) Handles Button1.Click

    ' Measure the time for 1 million calls to the function

    Dim _start As DateTime = DateTime.Now

    For _i As Integer = 1 To 1000000
      Dim _ss() As String = SplitDelimitedLine("1,2,""This is some, sample
text"",19.90,")
    Next

    Console.WriteLine(DateTime.Now.Subtract(_start).TotalSeconds)

  End Sub

On my machine it takes 3.281166 seconds and I do not see any significant
impact on the memory resources.


Show quoteHide quote
"klineb" <briankl***@hotmail.com> wrote in message
news:1144115553.461159.67610@i39g2000cwa.googlegroups.com...
> Tom,
>
> Some of the fields are comment field that contain comma.
>
> Ex.   1,2,"This is some, sample text",19.90,
>
> Using SplitSting will not handle this.
>
Author
4 Apr 2006 5:02 AM
Branco Medeiros
klineb wrote:
<snip>
> I have written this utility with VB.NET 2003 and when I parse all of
> the files I run out of memory.  The following functions seems to be the
> main source of my leak.  Any help optimizing this code is appreciated.
>
> Public Function SplitDelimitedLine(ByVal CurrentLine As StringBuilder,
> _
>             ByRef SplitString() As String) As Boolean
<snip, snip, snip>

I don't know if this is of any help, but you really don't need a
StringBuilder to "capture" the String slices. Whenever you find the
delimiter, you just need to know where the slice begins.

Disclaimer:  I *didn't* test the code bellow

<AirCode>
Function SplitDelimitedLine( _
  CurrentLine As StringBuilder, _
  SplitString() As String _
) As Boolean

Dim Text As String = CurrentLine.ToString
Dim Max As Integer = Text.Length - 1
Dim SliceStart As Integer
DIm Total As Integer

For Index As Integer = 0 To Max

  Dim IgnoreComma As Boolean

  Select Case Text(Index)
  Case gsDoubleQuote
    IgnoreComma = Not IgnoreComma
  Case gsComma
    If Not IgnoreComma Then
      Dim Count As Integer = Index - SliceStart
      SplitString(Total) = Text.Substring(SliceStart, Count).Trim
      SliceStart = Index + 1
      Total += 1
    End If
  End Select
Next

If SliceStart <= Max Then
  Dim Count As Integer = Max - SliceStart + 1
  SplitString(Total) = Text.Substring(SliceStart, Count).Trim
End If

Return True
</AirCode>

Regards,

Branco
Author
4 Apr 2006 7:32 AM
Cor Ligthert [MVP]
Brian,

I cannot imagen that this routine takes much memory. And if it would do,
what so ever.

It is a one time operation. The last thing you would think about is in my
opinion the amount of memory you use. Even if you have not enough than add
it. As I assume that you live in an North Atlantic Country, than one hour
thinking of the problem will cost probably more than 1Gb.

So you can only become in problem as you go over the 800Mb.

Just my thought,

Cor
Author
13 Apr 2006 5:06 AM
klineb
Thank you everybody for your suggestions.

This really is not a one time issue. We still have 60 DOS clients to
convert.  All of the clients are running different hardware so buying
memory is not really an option.  I will try all of you suggestions and
post my results back.

Does anyone know if running some of the class in a seperate process
would help any?

Thanks again for all of the feedback.

Brian