|
web
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Suggestions to reduce memory use when splitting a stringI have written and utility to convert our DOS COBOL data files to a SQL Server database. Part of the process requires parsing each line into a sql statement and validting the data to keep the integrity of the database. We are parsing roughl 81 files and range in size 1 kb to 65 MB files (Average of 400,000 lines in the larger files). I have written this utility with VB.NET 2003 and when I parse all of the files I run out of memory. The following functions seems to be the main source of my leak. Any help optimizing this code is appreciated. Public Function SplitDelimitedLine(ByVal CurrentLine As StringBuilder, _ ByRef SplitString() As String) As Boolean '7-25-2005 'BJK ' '--removed the use of Char replaced with: CurrentLine.Chars(i) ' '--------------------------------------- Dim i As Integer Dim CountDelimiter As Boolean Dim Total As Integer Dim lbResult As Boolean Dim Section As New StringBuilder Dim liLen As Integer Dim liCommaPos As Integer Dim liDQuotePos As Integer Try 'We want to count the delimiter unless it is within the text qualifier CountDelimiter = True Total = 0 liLen = CurrentLine.Length - 1 For i = 0 To liLen Select Case CurrentLine.Chars(i) Case gsDoubleQoute If CountDelimiter Then CountDelimiter = False Else CountDelimiter = True End If Case gsComma If CountDelimiter Then ' Add current section to collection SplitString(Total) = Section.ToString.Trim Section = Nothing Section = New StringBuilder Total = Total + 1 Else Section.Append(CurrentLine.Chars(i)) End If Case Else Section.Append(CurrentLine.Chars(i)) End Select Next ' Get the last field - as most files will not have an ending delimiter If CountDelimiter Then ' Add current section to collection SplitString(Total) = Section.ToString End If lbResult = True Catch ex As Exception ps_LastErrSource = ex.Source ps_LastErrDesc = ex.ToString lbResult = False Dim loSB As New StringBuilder(ps_LastErrDesc) UpdateLog(loSB) End Try Return lbResult End Function This function is stored in a class and is called from other function within this class. Thanks Brian hi Brian,
I would do that using a buffer. You open a StreamWriter and when the processed text (which you can store in a StringBuilder (sb) ) reaches a given size, for instance 2MB (or whatever you like, the sb holds more than 2GB), you can flush it to disk an start again with an empty buffer. To empty a sb it's sufficient to set the length to 0. When input ends, you flush to disk what remains in the sb. Remember to close the StreamWriter. -t klineb ha scritto: Show quoteHide quote > Good Day, > > I have written and utility to convert our DOS COBOL data files to a SQL > Server database. Part of the process requires parsing each line into a > sql statement and validting the data to keep the integrity of the > database. We are parsing roughl 81 files and range in size 1 kb to 65 > MB files (Average of 400,000 lines in the larger files). > > I have written this utility with VB.NET 2003 and when I parse all of > the files I run out of memory. The following functions seems to be the > main source of my leak. Any help optimizing this code is appreciated. > > Public Function SplitDelimitedLine(ByVal CurrentLine As StringBuilder, > _ > ByRef SplitString() As String) As Boolean > '7-25-2005 > 'BJK > ' > '--removed the use of Char replaced with: CurrentLine.Chars(i) > ' > '--------------------------------------- > Dim i As Integer > Dim CountDelimiter As Boolean > Dim Total As Integer > Dim lbResult As Boolean > Dim Section As New StringBuilder > Dim liLen As Integer > Dim liCommaPos As Integer > Dim liDQuotePos As Integer > > Try > 'We want to count the delimiter unless it is within the > text qualifier > CountDelimiter = True > Total = 0 > liLen = CurrentLine.Length - 1 > For i = 0 To liLen > Select Case CurrentLine.Chars(i) > Case gsDoubleQoute > If CountDelimiter Then > CountDelimiter = False > Else > CountDelimiter = True > End If > Case gsComma > If CountDelimiter Then > ' Add current section to collection > SplitString(Total) = Section.ToString.Trim > Section = Nothing > Section = New StringBuilder > Total = Total + 1 > Else > Section.Append(CurrentLine.Chars(i)) > End If > Case Else > Section.Append(CurrentLine.Chars(i)) > End Select > Next > ' Get the last field - as most files will not have an > ending delimiter > If CountDelimiter Then > ' Add current section to collection > SplitString(Total) = Section.ToString > End If > lbResult = True > Catch ex As Exception > ps_LastErrSource = ex.Source > ps_LastErrDesc = ex.ToString > lbResult = False > Dim loSB As New StringBuilder(ps_LastErrDesc) > UpdateLog(loSB) > End Try > Return lbResult > End Function > > This function is stored in a class and is called from other function > within this class. > Thanks > Brian Have you thought about regular expressions to parse the fields out of each
line? Going character by character seems really inefficient. At the very last, use functions like IndexOf, to find your delimiters, and parse out the pieces that way. Show quoteHide quote "klineb" <briankl***@hotmail.com> wrote in message news:1144088149.283259.318880@g10g2000cwb.googlegroups.com... > Good Day, > > I have written and utility to convert our DOS COBOL data files to a SQL > Server database. Part of the process requires parsing each line into a > sql statement and validting the data to keep the integrity of the > database. We are parsing roughl 81 files and range in size 1 kb to 65 > MB files (Average of 400,000 lines in the larger files). > > I have written this utility with VB.NET 2003 and when I parse all of > the files I run out of memory. The following functions seems to be the > main source of my leak. Any help optimizing this code is appreciated. > > Public Function SplitDelimitedLine(ByVal CurrentLine As StringBuilder, > _ > ByRef SplitString() As String) As Boolean > '7-25-2005 > 'BJK > ' > '--removed the use of Char replaced with: CurrentLine.Chars(i) > ' > '--------------------------------------- > Dim i As Integer > Dim CountDelimiter As Boolean > Dim Total As Integer > Dim lbResult As Boolean > Dim Section As New StringBuilder > Dim liLen As Integer > Dim liCommaPos As Integer > Dim liDQuotePos As Integer > > Try > 'We want to count the delimiter unless it is within the > text qualifier > CountDelimiter = True > Total = 0 > liLen = CurrentLine.Length - 1 > For i = 0 To liLen > Select Case CurrentLine.Chars(i) > Case gsDoubleQoute > If CountDelimiter Then > CountDelimiter = False > Else > CountDelimiter = True > End If > Case gsComma > If CountDelimiter Then > ' Add current section to collection > SplitString(Total) = Section.ToString.Trim > Section = Nothing > Section = New StringBuilder > Total = Total + 1 > Else > Section.Append(CurrentLine.Chars(i)) > End If > Case Else > Section.Append(CurrentLine.Chars(i)) > End Select > Next > ' Get the last field - as most files will not have an > ending delimiter > If CountDelimiter Then > ' Add current section to collection > SplitString(Total) = Section.ToString > End If > lbResult = True > Catch ex As Exception > ps_LastErrSource = ex.Source > ps_LastErrDesc = ex.ToString > lbResult = False > Dim loSB As New StringBuilder(ps_LastErrDesc) > UpdateLog(loSB) > End Try > Return lbResult > End Function > > This function is stored in a class and is called from other function > within this class. > Thanks > Brian >
Show quote
Hide quote
"klineb" <briankl***@hotmail.com> wrote in message There are a few tweaks you can apply here to reduce memory utilization (in news:1144088149.283259.318880@g10g2000cwb.googlegroups.com... > Good Day, > > I have written and utility to convert our DOS COBOL data files to a SQL > Server database. Part of the process requires parsing each line into a > sql statement and validting the data to keep the integrity of the > database. We are parsing roughl 81 files and range in size 1 kb to 65 > MB files (Average of 400,000 lines in the larger files). > > I have written this utility with VB.NET 2003 and when I parse all of > the files I run out of memory. The following functions seems to be the > main source of my leak. Any help optimizing this code is appreciated. > particular you can perhaps reuse the StringBuilders). However there's nothing obvious and terrible here. In short this function should not cause an out of memory error. David klineb wrote:
Show quoteHide quote >Good Day, It looks to me like your field separator is always a comma. > >I have written and utility to convert our DOS COBOL data files to a SQL >Server database. Part of the process requires parsing each line into a >sql statement and validting the data to keep the integrity of the >database. We are parsing roughl 81 files and range in size 1 kb to 65 >MB files (Average of 400,000 lines in the larger files). > >I have written this utility with VB.NET 2003 and when I parse all of >the files I run out of memory. The following functions seems to be the >main source of my leak. Any help optimizing this code is appreciated. > >Public Function SplitDelimitedLine(ByVal CurrentLine As StringBuilder, >_ > ByRef SplitString() As String) As Boolean > '7-25-2005 > 'BJK > ' > '--removed the use of Char replaced with: CurrentLine.Chars(i) > ' > '--------------------------------------- > Dim i As Integer > Dim CountDelimiter As Boolean > Dim Total As Integer > Dim lbResult As Boolean > Dim Section As New StringBuilder > Dim liLen As Integer > Dim liCommaPos As Integer > Dim liDQuotePos As Integer > > Try > 'We want to count the delimiter unless it is within the >text qualifier > CountDelimiter = True > Total = 0 > liLen = CurrentLine.Length - 1 > For i = 0 To liLen > Select Case CurrentLine.Chars(i) > Case gsDoubleQoute > If CountDelimiter Then > CountDelimiter = False > Else > CountDelimiter = True > End If > Case gsComma > If CountDelimiter Then > ' Add current section to collection > SplitString(Total) = Section.ToString.Trim > Section = Nothing > Section = New StringBuilder > Total = Total + 1 > Else > Section.Append(CurrentLine.Chars(i)) > End If > Case Else > Section.Append(CurrentLine.Chars(i)) > End Select > Next > ' Get the last field - as most files will not have an >ending delimiter > If CountDelimiter Then > ' Add current section to collection > SplitString(Total) = Section.ToString > End If > lbResult = True > Catch ex As Exception > ps_LastErrSource = ex.Source > ps_LastErrDesc = ex.ToString > lbResult = False > Dim loSB As New StringBuilder(ps_LastErrDesc) > UpdateLog(loSB) > End Try > Return lbResult > End Function > >This function is stored in a class and is called from other function >within this class. >Thanks >Brian > > > So why not just use SplitString = split(CurrentLine,","), then use Replace on each string in the resultant array if you want to get rid of the quotes. I think this will be much faster than going char by char. The parameter SplitString will have to be declared as byRef SplitString as Array, rather than SplitString() as String. Tom Actually, arrays are reference types. So modifying the contents of the
array, will work just fine as far as filling it. It doesn't need to be ByRef. Show quoteHide quote "tomb" <t***@technetcenter.com> wrote in message news:8hfYf.32917$67.29464@bignews6.bellsouth.net... > klineb wrote: > >>Good Day, >> >>I have written and utility to convert our DOS COBOL data files to a SQL >>Server database. Part of the process requires parsing each line into a >>sql statement and validting the data to keep the integrity of the >>database. We are parsing roughl 81 files and range in size 1 kb to 65 >>MB files (Average of 400,000 lines in the larger files). >> >>I have written this utility with VB.NET 2003 and when I parse all of >>the files I run out of memory. The following functions seems to be the >>main source of my leak. Any help optimizing this code is appreciated. >> >>Public Function SplitDelimitedLine(ByVal CurrentLine As StringBuilder, >>_ >> ByRef SplitString() As String) As Boolean >> '7-25-2005 >> 'BJK >> ' >> '--removed the use of Char replaced with: CurrentLine.Chars(i) >> ' >> '--------------------------------------- >> Dim i As Integer >> Dim CountDelimiter As Boolean >> Dim Total As Integer >> Dim lbResult As Boolean >> Dim Section As New StringBuilder >> Dim liLen As Integer >> Dim liCommaPos As Integer >> Dim liDQuotePos As Integer >> >> Try >> 'We want to count the delimiter unless it is within the >>text qualifier >> CountDelimiter = True >> Total = 0 >> liLen = CurrentLine.Length - 1 >> For i = 0 To liLen >> Select Case CurrentLine.Chars(i) >> Case gsDoubleQoute >> If CountDelimiter Then >> CountDelimiter = False >> Else >> CountDelimiter = True >> End If >> Case gsComma >> If CountDelimiter Then >> ' Add current section to collection >> SplitString(Total) = Section.ToString.Trim >> Section = Nothing >> Section = New StringBuilder >> Total = Total + 1 >> Else >> Section.Append(CurrentLine.Chars(i)) >> End If >> Case Else >> Section.Append(CurrentLine.Chars(i)) >> End Select >> Next >> ' Get the last field - as most files will not have an >>ending delimiter >> If CountDelimiter Then >> ' Add current section to collection >> SplitString(Total) = Section.ToString >> End If >> lbResult = True >> Catch ex As Exception >> ps_LastErrSource = ex.Source >> ps_LastErrDesc = ex.ToString >> lbResult = False >> Dim loSB As New StringBuilder(ps_LastErrDesc) >> UpdateLog(loSB) >> End Try >> Return lbResult >> End Function >> >>This function is stored in a class and is called from other function >>within this class. >>Thanks >>Brian >> >> > It looks to me like your field separator is always a comma. So why not > just use SplitString = split(CurrentLine,","), then use Replace on each > string in the resultant array if you want to get rid of the quotes. I > think this will be much faster than going char by char. > The parameter SplitString will have to be declared as byRef SplitString > as Array, rather than SplitString() as String. > > Tom Tom,
Some of the fields are comment field that contain comma. Ex. 1,2,"This is some, sample text",19.90, Using SplitSting will not handle this. Try this:
Public Function SplitDelimitedLine(ByVal CurrentLine As String) As String() Try Dim _wl As String = String.Empty ' work string ' Create a local copy of CurrentLine ' We don't really need to but I have for the sake of clarity ' The 3 Trim's peel of any extraneous whitespace then any leading and/or trailing commas and then any extraneous whitespace thet might have been any leading and/or trailing commas that might have been present Dim _cl As String = CurrentLine.Trim.Trim(","c).Trim ' Find the first " character Dim _pos As Integer = _cl.IndexOf(""""c) ' If _pos = -1 then there weren't any ' Loop until there are no more " characters While _pos > -1 ' Append every thing before the first " character to the work string _wl &= _cl.Substring(0, _pos) ' Remove every thing before the first " character from the local copy _cl = _cl.Remove(0, _pos) ' Find the next " character ' Note that we start the find from the 2nd position because we know that there is a " character in position 1 (index 0) _pos = _cl.IndexOf(""""c, 1) If _pos > -1 Then ' If we find one then we append every thing from the first " to the 2nd " inclusive to the work string, replacing any commas in the substring with a tilde and remove the same number of characters from the local copy _wl &= _cl.Substring(0, _pos + 1).Replace(","c, "~"c) _cl = _cl.Remove(0, _pos + 1) ' Find the next " character ' Note that we are now back to finding from the beginning of what is left of the local copy _pos = _cl.IndexOf(""""c) End If End While ' There are no moe " characters so append the remainder of the local copy to the work string _wl &= _cl ' Split the work string using the comma as the delimiter Dim _ss As String() = _wl.Replace("""", String.Empty).Split(","c) ' Work through the elements of the array For _i As Integer = 0 To _ss.Length - 1 If the element contains any tilde characters then replace them with commas If _ss(_i).IndexOf("~"c) > -1 Then _ss(_i) = _ss(_i).Replace("~"c, ","c) 'Trim any whitespaces from the element _ss(_i) = _ss(_i).Trim Next ' Return the string array Return _ss Catch ' We hit a problem of some description so return Nothing (null) Return Nothing End Try End Function Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click ' Measure the time for 1 million calls to the function Dim _start As DateTime = DateTime.Now For _i As Integer = 1 To 1000000 Dim _ss() As String = SplitDelimitedLine("1,2,""This is some, sample text"",19.90,") Next Console.WriteLine(DateTime.Now.Subtract(_start).TotalSeconds) End Sub On my machine it takes 3.281166 seconds and I do not see any significant impact on the memory resources. Show quoteHide quote "klineb" <briankl***@hotmail.com> wrote in message news:1144115553.461159.67610@i39g2000cwa.googlegroups.com... > Tom, > > Some of the fields are comment field that contain comma. > > Ex. 1,2,"This is some, sample text",19.90, > > Using SplitSting will not handle this. > klineb wrote:
<snip> > I have written this utility with VB.NET 2003 and when I parse all of <snip, snip, snip>> the files I run out of memory. The following functions seems to be the > main source of my leak. Any help optimizing this code is appreciated. > > Public Function SplitDelimitedLine(ByVal CurrentLine As StringBuilder, > _ > ByRef SplitString() As String) As Boolean I don't know if this is of any help, but you really don't need a StringBuilder to "capture" the String slices. Whenever you find the delimiter, you just need to know where the slice begins. Disclaimer: I *didn't* test the code bellow <AirCode> Function SplitDelimitedLine( _ CurrentLine As StringBuilder, _ SplitString() As String _ ) As Boolean Dim Text As String = CurrentLine.ToString Dim Max As Integer = Text.Length - 1 Dim SliceStart As Integer DIm Total As Integer For Index As Integer = 0 To Max Dim IgnoreComma As Boolean Select Case Text(Index) Case gsDoubleQuote IgnoreComma = Not IgnoreComma Case gsComma If Not IgnoreComma Then Dim Count As Integer = Index - SliceStart SplitString(Total) = Text.Substring(SliceStart, Count).Trim SliceStart = Index + 1 Total += 1 End If End Select Next If SliceStart <= Max Then Dim Count As Integer = Max - SliceStart + 1 SplitString(Total) = Text.Substring(SliceStart, Count).Trim End If Return True </AirCode> Regards, Branco Brian,
I cannot imagen that this routine takes much memory. And if it would do, what so ever. It is a one time operation. The last thing you would think about is in my opinion the amount of memory you use. Even if you have not enough than add it. As I assume that you live in an North Atlantic Country, than one hour thinking of the problem will cost probably more than 1Gb. So you can only become in problem as you go over the 800Mb. Just my thought, Cor Thank you everybody for your suggestions.
This really is not a one time issue. We still have 60 DOS clients to convert. All of the clients are running different hardware so buying memory is not really an option. I will try all of you suggestions and post my results back. Does anyone know if running some of the class in a seperate process would help any? Thanks again for all of the feedback. Brian
How do you use an unbound DataAdapter?
Easy One: Bind a Text Box to a an Integer Variable Dummy question Can't get rid of references WMI script to get Win32_Product Where to do the Loading tasks for application ? Getting DataGrid row X, Y position and Changing Column Width (2003) Using PLink interactively as a Telnet process with VB.Net 2003 Reflection Question and MDI question Form Focus |
|||||||||||||||||||||||