|
web
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
IO function in Vb.Net slower than in Vb6.0Hi,
I'm re-writing a VB6 app in Vb.Net. This basically reads a text file using streamreader one line at a time, parses the string using substring, trim functions and writes the parsed string to an output text file using streamwriter. I've noticed while testing that this is 15 secs slower than the VB6 app. Wonder why it is slow. Can someone give me some pointers? Thanks. Appreciate your time. "hillcountry74" <shruth***@yahoo.com> schrieb: VB.NET applications are stored in IL (Intermediate Language) instead of > I'm re-writing a VB6 app in Vb.Net. This basically reads a text file > using streamreader one line at a time, parses the string using > substring, trim functions and writes the parsed string to an output > text file using streamwriter. I've noticed while testing that this is > 15 secs slower than the VB6 app. Wonder why it is slow. Can someone > give me some pointers? native code. At runtime, the CLR's JIT compiler converts the methods' IL to native code. This process will take some time and can influence the runtime of your application. However, I think that there might be a different reason for the performance differences. Could you post the VB6 code and the corresponding VB.NET version of this code? -- M S Herfried K. Wagner M V P <URL:http://dotnet.mvps.org/> V B <URL:http://classicvb.org/petition/> Thanks for your response
Here is the code: Public Overrides Sub PreferredInputProcessing() Dim InputFileReader As StreamReader Dim UndefinedBenefitsFileWriter As StreamWriter Dim DataBlock As MHNet.ApplicationBlocks.Data.SqlHelper Dim ds As New DataSet() Dim dr As SqlClient.SqlDataReader Dim sSql As New StringBuilder() Dim iNewOptionCount As Integer Dim sPreferredInputFile As System.String Dim iRecordsProcessed As System.Int32 'Open input file InputFileReader = New StreamReader(_sInputFileLocation) 'Open output file OpenOutputFiles() 'Create UndefinedBenefits file UndefinedBenefitsFileWriter = New StreamWriter("C:\EDIFiles\" & _sRateCode & "UndefinedBenefits.csv") 'Zero out the number of records processed. ProcessReport.RecordsProcessed = 0 'Set validation properties _bValidateEnrollType = True _bValidateHeadofHouse = True 'The Celanese benefits don't have a value for PrimaryStatus in the elig file, 'but are actually = Primary. So, clsMain defaults it to primary status. Select Case _sRateCode Case "CELANESEMBH", "MMSI", "GLOBALHEALTH" _bValidatePrimaryStatus = False Case Else _bValidatePrimaryStatus = True End Select _bValidateMaritalStatus = False Do While InputFileReader.Peek > -1 InitializeInputVariables() sPreferredInputFile = Nothing sPreferredInputFile = InputFileReader.ReadLine '/ skip blank lines If sPreferredInputFile.Trim <> "" AndAlso sPreferredInputFile.Trim("?"c) <> "" AndAlso sPreferredInputFile.Trim.Length >= 439 Then iRecordsProcessed = iRecordsProcessed + 1 '/ update display every 100 records If iRecordsProcessed Mod 100 = 0 Then Status = "Records processed: " & iRecordsProcessed RaiseEvent ProcessStatus(Me, New System.EventArgs()) End If 'Set the input properties by extracting specific 'values from the input record. _sInActionCode = Trim(Mid(sPreferredInputFile, 1, 1)) _sInCarrierMemId = Trim(Mid(sPreferredInputFile, 2, 25)) _sInLastName = Trim(Mid(sPreferredInputFile, 27, 60)) _sInFirstName = Trim(Mid(sPreferredInputFile, 87, 30)) _sInMiddleName = Trim(Mid(sPreferredInputFile, 117, 15)) _sInAddr1 = Trim(Mid(sPreferredInputFile, 132, 60)) _sInAddr2 = Trim(Mid(sPreferredInputFile, 192, 60)) _sInCity = Trim(Mid(sPreferredInputFile, 252, 30)) _sInState = Trim(Mid(sPreferredInputFile, 282, 2)) _sInZip = Trim(Mid(sPreferredInputFile, 284, 10)) _sInBenefitOption = Trim(Mid(sPreferredInputFile, 294, 60)) _sInEmployerGroup = Trim(Mid(sPreferredInputFile, 354, 15)) _sInOptionEffDate = Trim(Mid(sPreferredInputFile, 369, 8)) _sInHPEffDate = Trim(Mid(sPreferredInputFile, 377, 8)) _sInTermDate = Trim(Mid(sPreferredInputFile, 385, 8)) If _sInTermDate = "" Or Not IsDateValid(AddDateDashes(_sInTermDate)) Then _sInTermDate = _sMagicTermDate End If 'TERMING PLANS to set date to manual date Select Case _sRateCode Case "MMSI" If _sInTermDate > "20041231" Then _sInTermDate = "20041231" End If End Select _sInSex = sPreferredInputFile.Substring(392, 1).Trim Dim sTmp As System.String sTmp = Trim(Mid(sPreferredInputFile, 394, 8)) If sTmp <> "" Then _sInDOB = Trim(Mid(sTmp, 1, 4)) & "-" & Trim(Mid(sTmp, 5, 2)) & _ "-" & Trim(Mid(sTmp, 7, 2)) End If _sInSSN = Trim(Mid(sPreferredInputFile, 402, 9)) _sInPhone = Trim(Mid(sPreferredInputFile, 411, 12)) If _sInPhone.Length = 12 Then _sInPhone = Trim(Mid(_sInPhone, 1, 3)) & Trim(Mid(_sInPhone, 5, 3)) & Trim(Mid(_sInPhone, 9, 4)) End If sTmp = sPreferredInputFile.Substring(422, 8).Trim If sTmp <> "" Then _sInEmployerGroupAnivDate = Trim(Mid(sTmp, 1, 4)) & _ "-" & Trim(Mid(sTmp, 5, 2)) & _ "-" & Trim(Mid(sTmp, 7, 2)) End If _sInHeadOfHouse = Trim(Mid(sPreferredInputFile, 431, 9)) If _sInHeadOfHouse = "" Then _sInHeadOfHouse = Trim(Mid(_sInCarrierMemId, 2, 9)) End If _sInPrimaryStatus = Trim(Mid(sPreferredInputFile, 440, 1)) _sInEnrollType = Trim(Mid(sPreferredInputFile, 441, 1)) Try _sInMaritalStatus = Trim(Mid(sPreferredInputFile, 442, 1)) Catch ex As System.ArgumentOutOfRangeException If ex.Message.IndexOf("Index and length must refer to a location within the string") > 0 Then _sInMaritalStatus = "" End Try 'Validate the incoming record. Validate() f _bValidated Then BuildOutputRecord() WriteOutputRecord() ProcessReport.TotalSuccessfulRecords = ProcessReport.TotalRecordsProcessed - ProcessReport.TotalErrorRecords Else WriteOutputErrorRecord() ProcessReport.TotalErrorRecords = ProcessReport.TotalErrorRecords + 1 End If End If 'skip blank lines Loop This is the main parsing routine. Thanks for your help. "hillcountry74" <shruth***@yahoo.com> schrieb: How often is this exception thrown? Instead of catching an exception make > Try > _sInMaritalStatus = Trim(Mid(sPreferredInputFile, 442, > 1)) > Catch ex As System.ArgumentOutOfRangeException > If ex.Message.IndexOf("Index and length must refer >to a location within the string") > 0 Then _sInMaritalStatus = "" > End Try sure that the indices are valid. In addition to that, check the performance of the release version (not the debug) version of the application when it's started outside the IDE. -- M S Herfried K. Wagner M V P <URL:http://dotnet.mvps.org/> V B <URL:http://classicvb.org/petition/> This exception is called maybe 1 out of 1000 times. I've tried to see
if this makes a difference by commenting out this piece of code , but no diff. I've compiled in release mode and executed the app for the resulting exe. But there is also a .pdb file which I think is created when I run the app in debug mode. Is there anything else I'm missing. Do you think the substring,trim functions will slow down? Or is it the IO the cause? Thanks for your time.
Show quote
Hide quote
"hillcountry74" <shruth***@yahoo.com> wrote in I've spent a year or so on vb.net and still consider myself new, but in news:1112055874.828774.191950@g14g2000cwa.googlegroups.com: > This exception is called maybe 1 out of 1000 times. I've tried to see > if this makes a difference by commenting out this piece of code , but > no diff. > > I've compiled in release mode and executed the app for the resulting > exe. But there is also a .pdb file which I think is created when I run > the app in debug mode. > > Is there anything else I'm missing. Do you think the substring,trim > functions will slow down? Or is it the IO the cause? > > Thanks for your time. > my opinion it is the many small reads that are slowing you down. Since I don't know the structue of the file (but it sounds like a text file with the records all smashed together) I'll suggest a couple of way *I THINK* will speed it up. 1) (If file has delimiters) - Read the whole file into a string, and use String.Split() to create an array that you can then map to variables or just write it straight out ->outFile.Write(array(elementNumber)) 2) Read the whole file into a string, and use RegularExpressions.RegEx and RegularExpressions.MatchCollection to break the string into parts and process from there (done right this should solve the "Trim" problem. 3) If you have control over the file format(which i assume you don't) fix the file format so you can read it in line by line without further processing. 4) Out of Ideas:) Let me know if it helps or you need help with one of the above:) MP At face value there does not appear to be anything that is an obvious
bottleneck, however you do call a number of methods that you have not described, (OpenOutputFiles, InitializeInputVariables, Validate, BuildOutputRecord, WriteOutputRecord, WriteOutputErrorRecord, WriteOutputErrorRecord), and it is possible that there is bottleneck in any of those. In addition you raise event ProcessStatus regularly and it would be prudent to ensure that that whatever is handling that event is not blocking the process for an inordinate length of time. From my point of view the number of calls to Trim could be a factor and perhaps some of of them are redundant. For example, take the line: _sInActionCode = Trim(Mid(sPreferredInputFile, 1, 1)) If the first character of a 'record' always contains a non-space character then Trim is redundant. In this case there are 3 new strings being created, (remember that strings are immutable), and there is an overhead, abeit small, involved in the creation of each string. Removing the Trim from this line would mean that there are only 2 new strings being created thus reducing the overhead accordingly. With the number of string operations in your PreferredInputProcessing method this could be significant. You might also try modifying the string parsing to the '.NET way'., for example: _sInActionCode = sPreferredInputFile.SubString(0, 1).Trim or _sInActionCode = sPreferredInputFile.SubString(0, 1) I do not have any benchmarking data but it is possible that you might find a performance increase. Another place where, in my view there extraneous overhead is: If sPreferredInputFile.Trim <> "" AndAlso sPreferredInputFile.Trim("?"c) <> "" AndAlso sPreferredInputFile.Trim.Length >= 439 Then Note here that you are using the System.String.Trim method rather than the Microsoft.VisualBasic.Trim function. The Microsoft.VisualBasic.Trim function returns the source string with leading and trailing space (&H20) characters removed while the System.String.Trim method returns the source string after white space characters are removed from the beginning and end. Note that there is a difference between 'space' characters and 'white space' characters. It is unclear what actual character is being specified in the sPreferredInputFile.Trim("?"c) clause but it is highly likely that it qualifies as 'white space' and is therfore being removed by the first clause. I would be inclined to code the test this: sPreferredInputFile = sPreferredInputFile.Trim() If sPreferredInputFile.Length >= 439 Then The 3 string operations are now reduced to 1 and the number of comparison operations is also reduced from three to one. Given the above you might be able to refine your parsing code and identify further redundancies. "hillcountry74" <shruth***@yahoo.com> wrote in message Thanks for your responsenews:1112050162.696747.167520@g14g2000cwa.googlegroups.com... Here is the code: Public Overrides Sub PreferredInputProcessing() Dim InputFileReader As StreamReader Dim UndefinedBenefitsFileWriter As StreamWriter Dim DataBlock As MHNet.ApplicationBlocks.Data.SqlHelper Dim ds As New DataSet() Dim dr As SqlClient.SqlDataReader Dim sSql As New StringBuilder() Dim iNewOptionCount As Integer Dim sPreferredInputFile As System.String Dim iRecordsProcessed As System.Int32 'Open input file InputFileReader = New StreamReader(_sInputFileLocation) 'Open output file OpenOutputFiles() 'Create UndefinedBenefits file UndefinedBenefitsFileWriter = New StreamWriter("C:\EDIFiles\" & _sRateCode & "UndefinedBenefits.csv") 'Zero out the number of records processed. ProcessReport.RecordsProcessed = 0 'Set validation properties _bValidateEnrollType = True _bValidateHeadofHouse = True 'The Celanese benefits don't have a value for PrimaryStatus in the elig file, 'but are actually = Primary. So, clsMain defaults it to primary status. Select Case _sRateCode Case "CELANESEMBH", "MMSI", "GLOBALHEALTH" _bValidatePrimaryStatus = False Case Else _bValidatePrimaryStatus = True End Select _bValidateMaritalStatus = False Do While InputFileReader.Peek > -1 InitializeInputVariables() sPreferredInputFile = Nothing sPreferredInputFile = InputFileReader.ReadLine '/ skip blank lines If sPreferredInputFile.Trim <> "" AndAlso sPreferredInputFile.Trim("?"c) <> "" AndAlso sPreferredInputFile.Trim.Length >= 439 Then iRecordsProcessed = iRecordsProcessed + 1 '/ update display every 100 records If iRecordsProcessed Mod 100 = 0 Then Status = "Records processed: " & iRecordsProcessed RaiseEvent ProcessStatus(Me, New System.EventArgs()) End If 'Set the input properties by extracting specific 'values from the input record. _sInActionCode = Trim(Mid(sPreferredInputFile, 1, 1)) _sInCarrierMemId = Trim(Mid(sPreferredInputFile, 2, 25)) _sInLastName = Trim(Mid(sPreferredInputFile, 27, 60)) _sInFirstName = Trim(Mid(sPreferredInputFile, 87, 30)) _sInMiddleName = Trim(Mid(sPreferredInputFile, 117, 15)) _sInAddr1 = Trim(Mid(sPreferredInputFile, 132, 60)) _sInAddr2 = Trim(Mid(sPreferredInputFile, 192, 60)) _sInCity = Trim(Mid(sPreferredInputFile, 252, 30)) _sInState = Trim(Mid(sPreferredInputFile, 282, 2)) _sInZip = Trim(Mid(sPreferredInputFile, 284, 10)) _sInBenefitOption = Trim(Mid(sPreferredInputFile, 294, 60)) _sInEmployerGroup = Trim(Mid(sPreferredInputFile, 354, 15)) _sInOptionEffDate = Trim(Mid(sPreferredInputFile, 369, 8)) _sInHPEffDate = Trim(Mid(sPreferredInputFile, 377, 8)) _sInTermDate = Trim(Mid(sPreferredInputFile, 385, 8)) If _sInTermDate = "" Or Not IsDateValid(AddDateDashes(_sInTermDate)) Then _sInTermDate = _sMagicTermDate End If 'TERMING PLANS to set date to manual date Select Case _sRateCode Case "MMSI" If _sInTermDate > "20041231" Then _sInTermDate = "20041231" End If End Select _sInSex = sPreferredInputFile.Substring(392, 1).Trim Dim sTmp As System.String sTmp = Trim(Mid(sPreferredInputFile, 394, 8)) If sTmp <> "" Then _sInDOB = Trim(Mid(sTmp, 1, 4)) & "-" & Trim(Mid(sTmp, 5, 2)) & _ "-" & Trim(Mid(sTmp, 7, 2)) End If _sInSSN = Trim(Mid(sPreferredInputFile, 402, 9)) _sInPhone = Trim(Mid(sPreferredInputFile, 411, 12)) If _sInPhone.Length = 12 Then _sInPhone = Trim(Mid(_sInPhone, 1, 3)) & Trim(Mid(_sInPhone, 5, 3)) & Trim(Mid(_sInPhone, 9, 4)) End If sTmp = sPreferredInputFile.Substring(422, 8).Trim If sTmp <> "" Then _sInEmployerGroupAnivDate = Trim(Mid(sTmp, 1, 4)) & _ "-" & Trim(Mid(sTmp, 5, 2)) & _ "-" & Trim(Mid(sTmp, 7, 2)) End If _sInHeadOfHouse = Trim(Mid(sPreferredInputFile, 431, 9)) If _sInHeadOfHouse = "" Then _sInHeadOfHouse = Trim(Mid(_sInCarrierMemId, 2, 9)) End If _sInPrimaryStatus = Trim(Mid(sPreferredInputFile, 440, 1)) _sInEnrollType = Trim(Mid(sPreferredInputFile, 441, 1)) Try _sInMaritalStatus = Trim(Mid(sPreferredInputFile, 442, 1)) Catch ex As System.ArgumentOutOfRangeException If ex.Message.IndexOf("Index and length must refer to a location within the string") > 0 Then _sInMaritalStatus = "" End Try 'Validate the incoming record. Validate() f _bValidated Then BuildOutputRecord() WriteOutputRecord() ProcessReport.TotalSuccessfulRecords = ProcessReport.TotalRecordsProcessed - ProcessReport.TotalErrorRecords Else WriteOutputErrorRecord() ProcessReport.TotalErrorRecords = ProcessReport.TotalErrorRecords + 1 End If End If 'skip blank lines Loop This is the main parsing routine. Thanks for your help. HillCountry,
> I'm re-writing a VB6 app in Vb.Net. This basically reads a text file When you want to test this, than you should use comparable code.> using streamreader one line at a time, parses the string using > substring, trim functions and writes the parsed string to an output > text file using streamwriter. I've noticed while testing that this is > 15 secs slower than the VB6 app. Wonder why it is slow. Can someone > give me some pointers? > That means Read inputline outputline = inputline Write outputline. Because the fact that I don't have VB6 installed I cannot test that. However it looks strange to me. Cor Thanks guys for your suggestions.
MP, The file does not have delimiters but follows a specific format and hence I used Mid to parse. Can you please give me more info on using regular exprs as a replacement for Trim function? Stephany, The file might contain a valid character in position 1. So, I still need to use Trim. even I'm suspecting Trim to be the cause. I read this article on MSDN, http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vstechart/html/vbtchmicrosoftvisualbasicnetinternals.asp and recommend using Mid instead of Substring. I couldn't get the diff. I will change the If condition as you have mentioned and let you know the results. And "?", I'm thinking is an unicode character. Earlier, I was not checking for this and in one of the files after the last record, this was there and when it tried to do a substring, it threw an exception. I was using Substring previously instead of Trim adn changed it subsequently after reading the above article. Guys, more suggestions really appreciated. I'm stuck with this issue from past 1 week. Please help!! Thanks again for your time. I wanted to add to the above: The parsing routine is in dll and is
being called from the frontend app which is a separate project but in the same solution. Could this architecture be a problem? Guys,
I commented calling the Validate method and it was faster by 10 secs. Here is the Validate method code. Please let me know how I can optimize this. (Please note that ValidateLast, firstname, state etc are all the same). I'm using IndexOf method in DoesBadCharacterExist method? Is there a better way? Thanks Protected Sub Validate() '********************************************************** ' Validate the current input record '********************************************************** Dim sTmpHold As System.String '/ Set the validated flag to True. _bValidated = True '/ Initialize the output error record. BuildOutputErrorRecord() '/ Member ID Validation If _sInCarrierMemId = "" Then Throw New InvalidFieldException.MissingMemberIDException() End If '/ Mhnet Member Validation If Not _bMhnetMember Then Select Case _sRateCode Case "HCUSA", "HUMANAFLHMO", "HUMANAFLPPO", "HUMANA" 'Commented out BB 2003-03-04 line from criteria because added humanafl 'And frmMain.cboRateCode = "HCUSA" Then Throw New InvalidFieldException.MHNetMemberException() Case Else End Select End If '/ Last name Validation - chars "A-Z,.-'0-9" Select Case ValidateLastName(_sInLastName, _sLastnameChars) Case 0 Throw New InvalidFieldException.MissingLastnameException() Case 1 Throw New InvalidFieldException.BadFormatLastnameException() Case Else End Select '/ First name validation - chars "A-Z.'" Select Case ValidateFirstName(_sInFirstName, _sFirstNameChars) Case 0 Throw New InvalidFieldException.MissingFirstnameException() Case 1 Throw New InvalidFieldException.BadFormatFirstnameException() Case Else End Select '/ Middle name validation - chars "A-Z" If ValidateMiddleName(_sInMiddleName, _sMiddleNameChars) <> True Then Throw New InvalidFieldException.BadFormatMiddlenameException() End If '/ City name validation - chars "A-Z.-'" If _sInCity <> "" Then _sInCity = _sInCity.Replace("/", "") 'added 20050221 BB _sInCity = _sInCity.Replace("\", "") 'added 20050221 BB _sInCity = _sInCity.Replace(",", "") 'added 20050221 BB If ValidateCityName(_sInCity, _sCityNameChars) <> True Then Throw New InvalidFieldException.BadFormatCityException() End If End If '/ State name validation - chars "A-Z" If ValidateStateName(_sInState, _sStateNameChars) <> True Then Throw New InvalidFieldException.BadFormatStateException() End If '/ SSN validation - make sure SSN is only numeric if it exists If _sInSSN <> "" Then _sInSSN = Mid(_sInSSN, 1, 9) If Not IsNumeric(_sInSSN) Then Throw New InvalidFieldException.BadFormatSSNException() End If End If '/ Phone validation - make sure Phone is only numeric if it exists If _sInPhone <> "" Then _sInPhone = _sInPhone.Replace("-", "") _sInPhone = _sInPhone.Replace(" ", "") 'added 20050221 BB _sInPhone = _sInPhone.Replace("*", "") 'added 20050221 BB _sInPhone = _sInPhone.Replace(".", "") 'added 20050221 BB _sInPhone = _sInPhone.Replace("/", "") 'added 20050221 BB If Not IsNumeric(_sInPhone) Then Throw New InvalidFieldException.BadFormatPhoneException() End If End If '/ Date of Birth Validation sTmpHold = AddDateDashes(_sInDOB) If Not IsDateValid(sTmpHold) Then Throw New InvalidFieldException.DateOfBirthException() Else _sInDOB = System.String.Format("{0:yyyyMMdd}", CType(sTmpHold, Date)) End If If sTmpHold > _sMagicTermDateWithDashes Then _sInDOB = _sMagicTermDate End If '_sInDOB = CheckMaxDate(_sInDOB) sTmpHold = AddDateDashes(_sInOptionEffDate) If IsDateValid(sTmpHold) Then If System.String.Format("{0:yyyyMMdd}", _sInOptionEffDate) < System.String.Format("{0:yyyyMMdd}", _sInceptionDate) Then _sInOptionEffDate = System.String.Format("{0:yyyyMMdd}", _sInceptionDate) Else _sInOptionEffDate = System.String.Format("{0:yyyyMMdd}", _sInOptionEffDate) End If Else ' _sInOptionEffDate in not valid Throw New InvalidFieldException.OptionEffDateException() End If 'End If 'Commented out above Code, goes with above comment out code block bb 2002-12-12 If sTmpHold > _sMagicTermDateWithDashes Then _sInOptionEffDate = _sMagicTermDate End If '_sInOptionEffDate = CheckMaxDate(_sInOptionEffDate) '/ if this contains " " (8 blanks) then this allows the 'code to choose inception or filedate for hpeffdate 'Commented code below while re-writing in .Net 'as for an invalid date it was always defaulting to an empty string and was processed 'when it should actually be errored. 'If Not IsDateValid(AddDateDashes(_sInHPEffDate)) Then ' _sInHPEffDate = "" 'End If sTmpHold = AddDateDashes(_sInHPEffDate) If _sInHPEffDate = "" Then 'If the file date is before the inception date, use the inception date. 'Otherwise, use the file date. If System.String.Format("{0:yyyyMMdd}", FileDateTime(_sInputFileLocation).ToString) < System.String.Format("{0:yyyyMMdd}", _sInceptionDate) Then _sInHPEffDate = System.String.Format("{0:yyyyMMdd}", _sInceptionDate) Else _sInHPEffDate = System.String.Format("{0:yyyyMMdd}", FileDateTime(_sInputFileLocation).ToString) End If Else If IsDateValid(sTmpHold) Then _sInHPEffDate = System.String.Format("{0:yyyyMMdd}", CType(sTmpHold, Date)) Else Throw New InvalidFieldException.HPEffDateException() End If End If '_sInHPEffDate = CheckMaxDate(_sInHPEffDate) If sTmpHold > _sMagicTermDateWithDashes Then _sInHPEffDate = _sMagicTermDate End If '/ Benefit Option Validation If _sInBenefitOption = "" Then Throw New InvalidFieldException.MissingBenefitOptionException() End If '/ Employer Group Validation If _sInEmployerGroup = "" And _bValidateEmployerGroup Then Throw New InvalidFieldException.MissingEmployerGroupException() End If '/ Set the Term Date to 12.31.2078 if the Term date is not a valid date. sTmpHold = AddDateDashes(_sInTermDate) If IsDateValid(sTmpHold) Then _sInTermDate = System.String.Format("{0:yyyyMMdd}", CType(sTmpHold, Date)) Else _sInTermDate = _sMagicTermDate End If ''/ Term Date Validation ''/ If (msInTermDate < Format(Now(), "yyyymmdd")) Then If (_sInTermDate < System.String.Format("{0:yyyyMMdd}", CType(_sInceptionDate, Date))) Then 'msOutErrorRec = msOutErrorRec & " Invalid Term Date Error: " & msInTermDate '/ changed per Kit on 4-17-2001 _sInTermDate = System.String.Format("{0:yyyyMMdd}", CType(_sInceptionDate, Date)) Throw New InvalidFieldException.TermDateException() End If '_sInTermDate = CheckMaxDate(_sInTermDate) If _sInTermDate > _sMagicTermDateWithDashes Then _sInTermDate = _sMagicTermDate End If '/ Employer Group Aniversary date validation If Not IsDateValid(AddDateDashes(_sInEmployerGroupAnivDate)) Then _sInEmployerGroupAnivDate = "" Else _sInEmployerGroupAnivDate = System.String.Format("{yyyymmdd}", AddDateDashes(_sInEmployerGroupAnivDate)) End If '_sInEmployerGroupAnivDate = CheckMaxDate(_sInEmployerGroupAnivDate) If _sInEmployerGroupAnivDate > _sMagicTermDateWithDashes Then _sInEmployerGroupAnivDate = _sMagicTermDate End If '/ If the Head of House is blank and the element was NOT supplied in the '/ submitted positive enrollment file, use the left nine characters '/ of the Carrier Member ID. If _sInHeadOfHouse = "" And _bValidateHeadofHouse = False Then _sInHeadOfHouse = _sInCarrierMemId.Substring(0, 9) End If '/ Head of House Validation - chars "A-Z,.-'0-9" Select Case ValidateHeadHouse(_sInHeadOfHouse, _sHeadOfHouseChars) Case 0 'If the Head of House element was supplied and was blank, reject the record. Throw New InvalidFieldException.MissingHeadofHouseException() '/ If Head of House contains garbage chars reject the record Case 1 Throw New InvalidFieldException.BadFormatHeadofHouseException() Case Else End Select '/ Primary Status Validation '/ If the Primary Status is blank and the Primary Status element was NOT '/ submitted as an element of the positive enrollment file, use "P" If _sInPrimaryStatus = "" And _bValidatePrimaryStatus = False Then _sInPrimaryStatus = "P" '/ If the Primary Status element was supplied and was blank, reject the record. ElseIf _sInPrimaryStatus = "" And _bValidatePrimaryStatus = True Then Throw New InvalidFieldException.MissingPrimaryStatusException() Else '/ it was supplied, make sure it is a P or S Select Case _sInPrimaryStatus.ToUpper Case "P", "S" Case Else Throw New InvalidFieldException.BadFormatPrimaryStatusException() End Select End If '/ Enroll Type Validation '/ If Enroll Type is blank and it was not one of the supplied elements in '/ the health plans positive enrollment file, set Enroll Type to "I". If _sInEnrollType = "" And _bValidateEnrollType = False Then _sInEnrollType = "I" '/ If the Enroll Type element was supplied and was blank, reject the record. ElseIf _sInEnrollType = "" And _bValidateEnrollType = True Then Throw New InvalidFieldException.MissingEnrollTypeException() Else '/ it was supplied, make sure it it a I,S,D,or C Select Case _sInEnrollType.ToUpper Case "I", "S", "D", "C" Case Else Throw New InvalidFieldException.BadFormatEnrollTypeException() End Select End If '/ If Marital status is supplied and was blank reject If _sInMaritalStatus = "" And _bValidateMaritalStatus = True Then Throw New InvalidFieldException.MissingMaritalStatusException() '/ assure that only "S" and "M" are passed Else Select Case _sInMaritalStatus.ToUpper Case "S", "M", "" Case Else Throw New InvalidFieldException.BadFormatMaritalStatusException() End Select End If End Sub Protected Overridable Function ValidateLastName(ByVal sSuspect As String, ByVal sGoodChars As String) As Integer If sSuspect.Length = 0 Then Return 0 End If If DoesBadCharExist(sSuspect, sGoodChars) = True Then Return 1 Else Return 2 End If End Function Protected Overridable Function ValidateFirstName(ByVal sSuspect As String, ByVal sGoodChars As String) As Integer If sSuspect.Length = 0 Then Return 0 End If If DoesBadCharExist(sSuspect, sGoodChars) = True Then Return 1 Else Return 2 End If End Function Protected Overridable Function AddDateDashes(ByVal sSuspect As String) As String '/ add dashes to dates so that Isdate function willl work properly '/ 2000-12-26 rlt Dim sCached As String sCached = sSuspect.Trim If sCached.Length = 8 Then Return sCached.Substring(0, 4) & "-" & sCached.Substring(4, 2) & "-" & sCached.Substring(6, 2) Else Return sCached End If End Function Protected Overridable Function DoesBadCharExist(ByVal sSuspect As String, ByVal sGoodChars As String) As Boolean Dim iCount As Integer For iCount = 0 To sSuspect.Length - 1 If sGoodChars.IndexOf(sSuspect.ToUpper.Chars(iCount)) < 0 Then Return True End If Next iCount Return False End Function I could be wrong but, I'm sure that MeltingPoint didn't realise that you are
dealing with a 'fixed' record when he alluded to using RegEx for the parsing. However RegEx would certainly be of assistance in the validation. Careful construction of RegEx expressions would effeciencies in this method e.g. it would make DoesBadCharacterExist obsolete. Now, don't take this the wrong way here, but from the fragments you have supplied and the obvious complexity of the operation, it is getting into the area where you might be better off engaging a consultant to review the project and make recommendations. Analysing the overall operation and making the appropriate recommendations would take a number of hours, if not days, and it would be unfair to expect those who donate their time and expertise, quite freely I might add, to advise on something with the scope of your project without being given the full picture. My analysis of your fragments is that there there is a lot more to your 'problem' than meets the eye and I consider that if you try to get advice 'piecemeal' then you won't end up getting the performance boost you are looking for and/or you will get advice that is entirely appropriate for the fragment in question but might cause problems for you in the 'bigger picture'. That said, feel free to post 'questions' about specfic things that you like advice on like 'How would I go about doing a benchmark test to see if Mid is more efficient than SubString' or 'How would I construct a Regex expression to make sure a string contains only certain characters'. Show quoteHide quote "hillcountry74" <shruth***@yahoo.com> wrote in message news:1112110584.467762.175290@o13g2000cwo.googlegroups.com... > Guys, > > I commented calling the Validate method and it was faster by 10 secs. > Here is the Validate method code. Please let me know how I can optimize > this. (Please note that ValidateLast, firstname, state etc are all the > same). I'm using IndexOf method in DoesBadCharacterExist method? Is > there a better way? Thanks > > Protected Sub Validate() > '********************************************************** > ' Validate the current input record > '********************************************************** > Dim sTmpHold As System.String > '/ Set the validated flag to True. > _bValidated = True > > > '/ Initialize the output error record. > BuildOutputErrorRecord() > > '/ Member ID Validation > If _sInCarrierMemId = "" Then > Throw New InvalidFieldException.MissingMemberIDException() > End If > > '/ Mhnet Member Validation > If Not _bMhnetMember Then > Select Case _sRateCode > Case "HCUSA", "HUMANAFLHMO", "HUMANAFLPPO", "HUMANA" > 'Commented out BB 2003-03-04 line from criteria > because added humanafl > 'And frmMain.cboRateCode = "HCUSA" Then > Throw New > InvalidFieldException.MHNetMemberException() > Case Else > End Select > End If > > '/ Last name Validation - chars "A-Z,.-'0-9" > Select Case ValidateLastName(_sInLastName, _sLastnameChars) > Case 0 > Throw New > InvalidFieldException.MissingLastnameException() > Case 1 > Throw New > InvalidFieldException.BadFormatLastnameException() > Case Else > End Select > > '/ First name validation - chars "A-Z.'" > Select Case ValidateFirstName(_sInFirstName, _sFirstNameChars) > Case 0 > Throw New > InvalidFieldException.MissingFirstnameException() > Case 1 > Throw New > InvalidFieldException.BadFormatFirstnameException() > Case Else > End Select > > '/ Middle name validation - chars "A-Z" > If ValidateMiddleName(_sInMiddleName, _sMiddleNameChars) <> > True Then > Throw New > InvalidFieldException.BadFormatMiddlenameException() > End If > > '/ City name validation - chars "A-Z.-'" > If _sInCity <> "" Then > _sInCity = _sInCity.Replace("/", "") 'added 20050221 BB > _sInCity = _sInCity.Replace("\", "") 'added 20050221 BB > _sInCity = _sInCity.Replace(",", "") 'added 20050221 BB > If ValidateCityName(_sInCity, _sCityNameChars) <> True Then > Throw New > InvalidFieldException.BadFormatCityException() > End If > End If > > '/ State name validation - chars "A-Z" > If ValidateStateName(_sInState, _sStateNameChars) <> True Then > Throw New InvalidFieldException.BadFormatStateException() > End If > > '/ SSN validation - make sure SSN is only numeric if it exists > If _sInSSN <> "" Then > _sInSSN = Mid(_sInSSN, 1, 9) > If Not IsNumeric(_sInSSN) Then > Throw New InvalidFieldException.BadFormatSSNException() > End If > End If > > '/ Phone validation - make sure Phone is only numeric if it > exists > If _sInPhone <> "" Then > _sInPhone = _sInPhone.Replace("-", "") > _sInPhone = _sInPhone.Replace(" ", "") 'added 20050221 BB > _sInPhone = _sInPhone.Replace("*", "") 'added 20050221 BB > _sInPhone = _sInPhone.Replace(".", "") 'added 20050221 BB > _sInPhone = _sInPhone.Replace("/", "") 'added 20050221 BB > If Not IsNumeric(_sInPhone) Then > Throw New > InvalidFieldException.BadFormatPhoneException() > End If > End If > > '/ Date of Birth Validation > sTmpHold = AddDateDashes(_sInDOB) > If Not IsDateValid(sTmpHold) Then > Throw New InvalidFieldException.DateOfBirthException() > Else > _sInDOB = System.String.Format("{0:yyyyMMdd}", > CType(sTmpHold, Date)) > End If > If sTmpHold > _sMagicTermDateWithDashes Then > _sInDOB = _sMagicTermDate > End If > '_sInDOB = CheckMaxDate(_sInDOB) > > sTmpHold = AddDateDashes(_sInOptionEffDate) > If IsDateValid(sTmpHold) Then > If System.String.Format("{0:yyyyMMdd}", _sInOptionEffDate) > < System.String.Format("{0:yyyyMMdd}", _sInceptionDate) Then > _sInOptionEffDate = > System.String.Format("{0:yyyyMMdd}", _sInceptionDate) > Else > _sInOptionEffDate = > System.String.Format("{0:yyyyMMdd}", _sInOptionEffDate) > End If > Else ' _sInOptionEffDate in not valid > Throw New InvalidFieldException.OptionEffDateException() > End If > 'End If > 'Commented out above Code, goes with above comment out code > block bb 2002-12-12 > If sTmpHold > _sMagicTermDateWithDashes Then > _sInOptionEffDate = _sMagicTermDate > End If > '_sInOptionEffDate = CheckMaxDate(_sInOptionEffDate) > > '/ if this contains " " (8 blanks) then this allows the > 'code to choose inception or filedate for hpeffdate > 'Commented code below while re-writing in .Net > 'as for an invalid date it was always defaulting to an empty > string and was processed > 'when it should actually be errored. > 'If Not IsDateValid(AddDateDashes(_sInHPEffDate)) Then > ' _sInHPEffDate = "" > 'End If > > sTmpHold = AddDateDashes(_sInHPEffDate) > If _sInHPEffDate = "" Then > 'If the file date is before the inception date, use the > inception date. > 'Otherwise, use the file date. > If System.String.Format("{0:yyyyMMdd}", > FileDateTime(_sInputFileLocation).ToString) < > System.String.Format("{0:yyyyMMdd}", _sInceptionDate) Then > _sInHPEffDate = System.String.Format("{0:yyyyMMdd}", > _sInceptionDate) > Else > _sInHPEffDate = System.String.Format("{0:yyyyMMdd}", > FileDateTime(_sInputFileLocation).ToString) > End If > Else > If IsDateValid(sTmpHold) Then > _sInHPEffDate = System.String.Format("{0:yyyyMMdd}", > CType(sTmpHold, Date)) > Else > Throw New InvalidFieldException.HPEffDateException() > End If > End If > '_sInHPEffDate = CheckMaxDate(_sInHPEffDate) > If sTmpHold > _sMagicTermDateWithDashes Then > _sInHPEffDate = _sMagicTermDate > End If > '/ Benefit Option Validation > If _sInBenefitOption = "" Then > Throw New > InvalidFieldException.MissingBenefitOptionException() > End If > > '/ Employer Group Validation > If _sInEmployerGroup = "" And _bValidateEmployerGroup Then > Throw New > InvalidFieldException.MissingEmployerGroupException() > End If > > '/ Set the Term Date to 12.31.2078 if the Term date is not a > valid date. > sTmpHold = AddDateDashes(_sInTermDate) > If IsDateValid(sTmpHold) Then > _sInTermDate = System.String.Format("{0:yyyyMMdd}", > CType(sTmpHold, Date)) > Else > _sInTermDate = _sMagicTermDate > End If > > ''/ Term Date Validation > ''/ If (msInTermDate < Format(Now(), "yyyymmdd")) Then > If (_sInTermDate < System.String.Format("{0:yyyyMMdd}", > CType(_sInceptionDate, Date))) Then > 'msOutErrorRec = msOutErrorRec & " Invalid Term Date Error: > " & msInTermDate > '/ changed per Kit on 4-17-2001 > _sInTermDate = System.String.Format("{0:yyyyMMdd}", > CType(_sInceptionDate, Date)) > Throw New InvalidFieldException.TermDateException() > End If > '_sInTermDate = CheckMaxDate(_sInTermDate) > If _sInTermDate > _sMagicTermDateWithDashes Then > _sInTermDate = _sMagicTermDate > End If > > '/ Employer Group Aniversary date validation > If Not IsDateValid(AddDateDashes(_sInEmployerGroupAnivDate)) > Then > _sInEmployerGroupAnivDate = "" > Else > _sInEmployerGroupAnivDate = > System.String.Format("{yyyymmdd}", > AddDateDashes(_sInEmployerGroupAnivDate)) > End If > '_sInEmployerGroupAnivDate = > CheckMaxDate(_sInEmployerGroupAnivDate) > If _sInEmployerGroupAnivDate > _sMagicTermDateWithDashes Then > _sInEmployerGroupAnivDate = _sMagicTermDate > End If > > '/ If the Head of House is blank and the element was NOT > supplied in the > '/ submitted positive enrollment file, use the left nine > characters > '/ of the Carrier Member ID. > If _sInHeadOfHouse = "" And _bValidateHeadofHouse = False Then > _sInHeadOfHouse = _sInCarrierMemId.Substring(0, 9) > End If > > '/ Head of House Validation - chars "A-Z,.-'0-9" > Select Case ValidateHeadHouse(_sInHeadOfHouse, > _sHeadOfHouseChars) > Case 0 'If the Head of House element was supplied and was > blank, reject the record. > Throw New > InvalidFieldException.MissingHeadofHouseException() > '/ If Head of House contains garbage chars reject the > record > Case 1 > Throw New > InvalidFieldException.BadFormatHeadofHouseException() > Case Else > End Select > > '/ Primary Status Validation > '/ If the Primary Status is blank and the Primary Status > element was NOT > '/ submitted as an element of the positive enrollment file, use > "P" > If _sInPrimaryStatus = "" And _bValidatePrimaryStatus = False > Then > _sInPrimaryStatus = "P" > '/ If the Primary Status element was supplied and was > blank, reject the record. > ElseIf _sInPrimaryStatus = "" And _bValidatePrimaryStatus = > True Then > Throw New > InvalidFieldException.MissingPrimaryStatusException() > Else '/ it was supplied, make sure it is a P or S > Select Case _sInPrimaryStatus.ToUpper > Case "P", "S" > Case Else > Throw New > InvalidFieldException.BadFormatPrimaryStatusException() > End Select > End If > > '/ Enroll Type Validation > '/ If Enroll Type is blank and it was not one of the supplied > elements in > '/ the health plans positive enrollment file, set Enroll Type > to "I". > If _sInEnrollType = "" And _bValidateEnrollType = False Then > _sInEnrollType = "I" > '/ If the Enroll Type element was supplied and was blank, > reject the record. > ElseIf _sInEnrollType = "" And _bValidateEnrollType = True Then > Throw New > InvalidFieldException.MissingEnrollTypeException() > Else '/ it was supplied, make sure it it a I,S,D,or C > Select Case _sInEnrollType.ToUpper > Case "I", "S", "D", "C" > Case Else > Throw New > InvalidFieldException.BadFormatEnrollTypeException() > End Select > End If > > '/ If Marital status is supplied and was blank reject > If _sInMaritalStatus = "" And _bValidateMaritalStatus = True > Then > Throw New > InvalidFieldException.MissingMaritalStatusException() > '/ assure that only "S" and "M" are passed > Else > Select Case _sInMaritalStatus.ToUpper > Case "S", "M", "" > Case Else > Throw New > InvalidFieldException.BadFormatMaritalStatusException() > End Select > End If > End Sub > > Protected Overridable Function ValidateLastName(ByVal sSuspect As > String, ByVal sGoodChars As String) As Integer > If sSuspect.Length = 0 Then > Return 0 > End If > If DoesBadCharExist(sSuspect, sGoodChars) = True Then > Return 1 > Else > Return 2 > End If > End Function > > Protected Overridable Function ValidateFirstName(ByVal sSuspect As > String, ByVal sGoodChars As String) As Integer > If sSuspect.Length = 0 Then > Return 0 > End If > If DoesBadCharExist(sSuspect, sGoodChars) = True Then > Return 1 > Else > Return 2 > End If > End Function > > Protected Overridable Function AddDateDashes(ByVal sSuspect As String) > As String > '/ add dashes to dates so that Isdate function willl work > properly > '/ 2000-12-26 rlt > Dim sCached As String > > sCached = sSuspect.Trim > If sCached.Length = 8 Then > Return sCached.Substring(0, 4) & "-" & sCached.Substring(4, > 2) & "-" & sCached.Substring(6, 2) > Else > Return sCached > End If > End Function > > Protected Overridable Function DoesBadCharExist(ByVal sSuspect As > String, ByVal sGoodChars As String) As Boolean > Dim iCount As Integer > For iCount = 0 To sSuspect.Length - 1 > If sGoodChars.IndexOf(sSuspect.ToUpper.Chars(iCount)) < 0 > Then > Return True > End If > Next iCount > Return False > End Function > "hillcountry74" <shruth***@yahoo.com> wrote in <lots o code>news:1112110584.467762.175290@o13g2000cwo.googlegroups.com: Some good ideas so far. I've started to put the regex expression together for you, could have it done in a few hours. If you want to send me one of these files, (important info changed of course) I could fine tune the expression. macmanic(zero)(zero)atHotmail.com Note to anyone else reading this thread, Any ideas on the speed of regex as opposed to Substring/IndexOf. I can say for sure that I've parsed a 4mb file with regex in a few hundred milliseconds. MeltingPoint,
Thanks for your help. I've just emailed a sample file (3.4MB). MeltingPoint wrote: Show quoteHide quote > "hillcountry74" <shruth***@yahoo.com> wrote in > news:1112110584.467762.175290@o13g2000cwo.googlegroups.com: > > <lots o code> > > Some good ideas so far. I've started to put the regex expression together > for you, could have it done in a few hours. If you want to send me one of > these files, (important info changed of course) I could fine tune the > expression. macmanic(zero)(zero)atHotmail.com > > Note to anyone else reading this thread, Any ideas on the speed of regex as > opposed to Substring/IndexOf. I can say for sure that I've parsed a 4mb > file with regex in a few hundred milliseconds. "hillcountry74" <shruth***@yahoo.com> wrote in <lots o code>news:1112110584.467762.175290@o13g2000cwo.googlegroups.com: Some good ideas so far. I've started to put the regex expression together for you, could have it done in a few hours. If you want to send me one of these files, (important info changed of course) I could fine tune the expression. macmanic(zero)(zero)atHotmail.com Note to anyone else reading this thread, Any ideas on the speed of regex as opposed to Substring/IndexOf. I can say for sure that I've parsed a 4mb file with regex in a few hundred milliseconds. ++Just saw stefs comment. I'm not sure what difference it makes as to weather its fixed or not. RegEx still works and its alot easier on the eyes:) ((?<ActionCode>.) (?<CarrierID>\d{0,25}) (?<LastName>\w{0,60}\s*\b) (?<FirstName>\w{0,30}\s*\b) (?<MiddleName>\w{0,15}\s*\b) (?<Addr1>.{0,60}\s*\b) (?<Addr2>.{0,60}\s*\b) (?<City>.{0,30}\s*\b) (?<State>.{0,2}\s*\b) (?<Zip>.{0,10}\s*\b)) Actually the fact that its fixed makes it easier. And a note as to how close I was paying attention: sPreferredInputFile.Trim.Length >= 439 does not 'allude' to me that it is totally fixed. However, I don't know Stef, she probably knows more than me, considering I just started using RegEx a month ago. But the above Regex does match the following: e8374837463784958473627495Sc9ott 8nglis Micheal 554 sdf sdf 667 rtert ertwert Hell FL90210 Which I think is what the record looks like (at least so far) Let me know, both of you :) MP I surrender. I was having an abberration and thinking of Regex for simple
pattern matching rather than it's 'extraction' capability. _sInHeadOfHouse = Trim(Mid(sPreferredInputFile, 431, 9)) .... _sInPrimaryStatus = Trim(Mid(sPreferredInputFile, 440, 1)) _sInEnrollType = Trim(Mid(sPreferredInputFile, 441, 1)) Try _sInMaritalStatus = Trim(Mid(sPreferredInputFile, 442, 1)) .... This stuff here indicates to me that the record is, more than likely, fixed. Note the Try ... Catch ... End Try to catch if there are not 442 characters, but there is no matching construct for position 440 and 441. The earlier test is for a record length of 439 characters or more, so the record might be 439, 440, 441 or 442 characters. The catcher on position 442 implies that characters 440 and 441 are always present. I read between the lines and decided that 442 'should' always be present. Given that hilcountry74 hasn't provided all the information this was a 50/50 call but for the purposes of the exercise is largely irrelevant. I have no problem with being proved wrong, but I don't think that your regex will work for parsing here. In your example thus far you rely on there being exactly 25 digits for CarrierID. If there are less then your match attempt for LastName won't start at position 27. Remember that the start position for each component of the string is specifically defined. Also there is no indication that CarrierID is numeric which means that it should use . instead of \d. To read the correct number of characters, the quantifier must be {25} rather than {0,25} and this means that you have read any trailing spaces as well which still have to be trimmed off when the matches are read out. (?<LastName>\w{0,60}\s*\b) will only handle simple names - those with no imbedded spaces or punctuation characters like "van Allen", "O'Brien", "Mandeville-Brown". Also it is common for company names to be stored in a LastName field and other name fields left blank like "Acme Inc.". \w will miss imbedded spaces, apostrophes, hyphens and periods. Another factor is that you get idiots hitting the spacebar just as they are starting to type a name and never correcting it so you can get " Smith". The \w will report no match at all in this case. Use of the \b will only make things worse in such cases. In this case I think that the Mid or SubString methods are best for the actual parsing, however regex will certainly make the validation routine more compact and efficient because here you are operating on each individual string rather than trying to pick the character sequence from postion x to position y and therefore 2nd guessing what is actually there or not there as the case may be. BTW: I have a perfectly good name - there is no need to assume that it needs contracting or that the spelling needs changing. Show quoteHide quote "MeltingPoint" <n***@all.com> wrote in message news:Ca6dnTEa3_YVYNTfRVn-jg@rogers.com... > "hillcountry74" <shruth***@yahoo.com> wrote in > news:1112110584.467762.175290@o13g2000cwo.googlegroups.com: > > <lots o code> > > Some good ideas so far. I've started to put the regex expression > together for you, could have it done in a few hours. If you want to send > me one of these files, (important info changed of course) I could fine > tune the expression. macmanic(zero)(zero)atHotmail.com > > Note to anyone else reading this thread, Any ideas on the speed of regex > as opposed to Substring/IndexOf. I can say for sure that I've parsed a > 4mb file with regex in a few hundred milliseconds. > > ++Just saw stefs comment. I'm not sure what difference it makes as to > weather its fixed or not. RegEx still works and its alot easier on the > eyes:) > ((?<ActionCode>.) > (?<CarrierID>\d{0,25}) > (?<LastName>\w{0,60}\s*\b) > (?<FirstName>\w{0,30}\s*\b) > (?<MiddleName>\w{0,15}\s*\b) > (?<Addr1>.{0,60}\s*\b) > (?<Addr2>.{0,60}\s*\b) > (?<City>.{0,30}\s*\b) > (?<State>.{0,2}\s*\b) > (?<Zip>.{0,10}\s*\b)) > Actually the fact that its fixed makes it easier. > > And a note as to how close I was paying attention: > sPreferredInputFile.Trim.Length >= 439 > does not 'allude' to me that it is totally fixed. > > However, I don't know Stef, she probably knows more than me, considering > I just started using RegEx a month ago. But the above Regex does match > the following: > > e8374837463784958473627495Sc9ott 8nglis > Micheal 554 sdf sdf > 667 rtert ertwert Hell > FL90210 > > Which I think is what the record looks like (at least so far) > > Let me know, both of you :) > MP
Show quote
Hide quote
"Stephany Young" <noone@localhost> wrote in I knew I would catch it for that :) Force of habit from my personal news:O$CKyCNNFHA.1176@TK2MSFTNGP15.phx.gbl: > I surrender. I was having an abberration and thinking of Regex for > simple pattern matching rather than it's 'extraction' capability. > > _sInHeadOfHouse = Trim(Mid(sPreferredInputFile, 431, 9)) > ... > _sInPrimaryStatus = Trim(Mid(sPreferredInputFile, 440, 1)) > _sInEnrollType = Trim(Mid(sPreferredInputFile, 441, 1)) > Try > _sInMaritalStatus = Trim(Mid(sPreferredInputFile, 442, 1)) > ... > > This stuff here indicates to me that the record is, more than likely, > fixed. Note the Try ... Catch ... End Try to catch if there are not > 442 characters, but there is no matching construct for position 440 > and 441. The earlier test is for a record length of 439 characters or > more, so the record might be 439, 440, 441 or 442 characters. The > catcher on position 442 implies that characters 440 and 441 are always > present. I read between the lines and decided that 442 'should' always > be present. Given that hilcountry74 hasn't provided all the > information this was a 50/50 call but for the purposes of the exercise > is largely irrelevant. > > I have no problem with being proved wrong, but I don't think that your > regex will work for parsing here. > > In your example thus far you rely on there being exactly 25 digits for > CarrierID. If there are less then your match attempt for LastName > won't start at position 27. Remember that the start position for each > component of the string is specifically defined. Also there is no > indication that CarrierID is numeric which means that it should use . > instead of \d. To read the correct number of characters, the > quantifier must be {25} rather than {0,25} and this means that you > have read any trailing spaces as well which still have to be trimmed > off when the matches are read out. > > (?<LastName>\w{0,60}\s*\b) will only handle simple names - those with > no imbedded spaces or punctuation characters like "van Allen", > "O'Brien", "Mandeville-Brown". Also it is common for company names to > be stored in a LastName field and other name fields left blank like > "Acme Inc.". \w will miss imbedded spaces, apostrophes, hyphens and > periods. Another factor is that you get idiots hitting the spacebar > just as they are starting to type a name and never correcting it so > you can get " Smith". The \w will report no match at all in this case. > Use of the \b will only make things worse in such cases. > > In this case I think that the Mid or SubString methods are best for > the actual parsing, however regex will certainly make the validation > routine more compact and efficient because here you are operating on > each individual string rather than trying to pick the character > sequence from postion x to position y and therefore 2nd guessing what > is actually there or not there as the case may be. > > BTW: I have a perfectly good name - there is no need to assume that it > needs contracting or that the spelling needs changing. life:) OK just checked it, imbedded spaces screw it up. And theres nothing I can think of readily. I've seen some funky reg exp's - I'm sure it can be done but not by me:) I tried just doing: ((?<ActionCode>.{1})" _ & "(?<CarrierID>.{25})" _ & "(?<LastName>.{60})" _ & "(?<FirstName>.{30})" _ & "(?<MiddleName>.{15})" _ & "(?<Addr1>.{60})" _ & "(?<Addr2>.{60})" _ & "(?<City>.{30)" _ & "(?<State>.{2})" _ & "(?<Zip>.{10}))" ....and my computer actually laughed at me!! Back to the drawing board... You have a typo in your "(?<City>.{30)" - a missing }
Anyway, this works a treat with the caveat that the target string has to be the expected length (442) or longer. On my machine 10000 takes 1 second give or take a few milliseconds and 100000 iterations takes 10 seconds give or take a few milliseconds. It is fair to say that, as writ and on my machine, as a parser it will handle approx 1000 records per second. So, I stand educated, you can do rudimentary parsing with Regex so long as the expression is very carefully constructed. Dim _s As String = "ACarrierID<16 spaces>" & _ "LastName<52 spaces>" & _ "FirstName<21 spaces>" & _ "MiddleName<5 spaces>" & _ "Addr1<55 spaces>" & _ "Addr2<55 spaces>" & _ "City<26 spaces>" & _ "StZip<7 spaces>" & _ "BenefitOption<47 spaces>" & _ "EmployerGroup OptionEfHPEffDatTermDate" & _ "SDOB<5 spaces>" & _ "SSN<6 spaces>" & _ "Phone<7 spaces>" & _ "EmployerHeadOfHouPM" Dim _exp As String = "(?<ActionCode>.{1})" & _ "(?<CarrierID>.{25})" & _ "(?<LastName>.{60})" & _ "(?<FirstName>.{30})" & _ "(?<MiddleName>.{15})" & _ "(?<Addr1>.{60})" & _ "(?<Addr2>.{60})" & _ "(?<City>.{30})" & _ "(?<State>.{2})" & _ "(?<Zip>.{10})" & _ "(?<BenefitOption>.{60})" & _ "(?<EmployerGroup>.{15})" & _ "(?<OptionEffDate>.{8})" & _ "(?<HPEffDate>.{8})" & _ "(?<TermDate>.{8})" & _ "(?<Sex>.{1})" & _ "(?<DOB>.{8})" & _ "(?<SSN>.{9})" & _ "(?<Phone>.{12})" & _ "(?<EmployerGroupAnivDate>.{8})" & _ "(?<HeadOfHouse>.{9})" & _ "(?<PrimaryStatus>.{1})" & _ "(?<MaritalStatus>.{1})" Dim r As Regex = New Regex(_exp) Dim m As Match = r.Match(_s) Dim _sInActionCode As String = m.Groups("ActionCode").ToString.Trim Dim _sInCarrierID As String = m.Groups("CarrierID").ToString.Trim Dim _sInLastName As String = m.Groups("LastName").ToString.Trim Dim _sInFirstName As String = m.Groups("FirstName").ToString.Trim Dim _sInMiddleName As String = m.Groups("MiddleName").ToString.Trim Dim _sInAddr1 As String = m.Groups("Addr1").ToString.Trim Dim _sInAddr2 As String = m.Groups("Addr2").ToString.Trim Dim _sInCity As String = m.Groups("City").ToString.Trim Dim _sInState As String = m.Groups("State").ToString.Trim Dim _sInZip As String = m.Groups("Zip").ToString.Trim Dim _sInBenefitOption As String = m.Groups("BenefitOption").ToString.Trim Dim _sInEmployerGroup As String = m.Groups("EmployerGroup").ToString.Trim Dim _sInOptionEffDate As String = m.Groups("OptionEffDate").ToString.Trim Dim _sInHPEffDate As String = m.Groups("OptionEffDate").ToString.Trim Dim _sInTermDate As String = m.Groups("HPEffDate").ToString.Trim Dim _sInSex As String = m.Groups("TermDate").ToString.Trim Dim _sInDOB As String = m.Groups("DOB").ToString.Trim Dim _sInSSN As String = m.Groups("SSN").ToString.Trim Dim _sInPhone As String = m.Groups("Phone").ToString.Trim Dim _sInEmployerGroupAnivDate As String = m.Groups("EmployerGroupAnivDate").ToString.Trim Dim _sInHeadOfHouse As String = m.Groups("HeadOfHouse").ToString.Trim Dim _sInPrimaryStatus As String = m.Groups("PrimaryStatus").ToString.Trim Dim _sInMaritalStatus As String = m.Groups("MaritalStatus").ToString.Trim Console.WriteLine("_sInActionCode = " & _sInActionCode) Console.WriteLine("_sInCarrierID = " & _sInCarrierID) Console.WriteLine("_sInLastName = " & _sInLastName) Console.WriteLine("_sInFirstName = " & _sInFirstName) Console.WriteLine("_sInMiddleName = " & _sInMiddleName) Console.WriteLine("_sInAddr1 = " & _sInAddr1) Console.WriteLine("_sInAddr2 = " & _sInAddr2) Console.WriteLine("_sInCity = " & _sInCity) Console.WriteLine("_sInState = " & _sInState) Console.WriteLine("_sInZip = " & _sInZip) Console.WriteLine("_sInBenefitOption = " & _sInBenefitOption) Console.WriteLine("_sInEmployerGroup = " & _sInEmployerGroup) Console.WriteLine("_sInOptionEffDate = " & _sInOptionEffDate) Console.WriteLine("_sInHPEffDate = " & _sInHPEffDate) Console.WriteLine("_sInTermDate = " & _sInTermDate) Console.WriteLine("_sInSex = " & _sInSex) Console.WriteLine("_sInDOB = " & _sInDOB) Console.WriteLine("_sInSSN = " & _sInSSN) Console.WriteLine("_sInPhone = " & _sInPhone) Console.WriteLine("_sInEmployerGroupAnivDate = " & _sInEmployerGroupAnivDate) Console.WriteLine("_sInHeadOfHouse = " & _sInHeadOfHouse) Console.WriteLine("_sInPrimaryStatus = " & _sInPrimaryStatus) Console.WriteLine("_sInMaritalStatus = " & _sInMaritalStatus) Show quoteHide quote "MeltingPoint" <n***@all.com> wrote in message news:YpmdndvMEa8bvNffRVn-sg@rogers.com... > "Stephany Young" <noone@localhost> wrote in > news:O$CKyCNNFHA.1176@TK2MSFTNGP15.phx.gbl: > > <snip> > > OK just checked it, imbedded spaces screw it up. And theres nothing I > can think of readily. I've seen some funky reg exp's - I'm sure it can > be done but not by me:) I tried just doing: > > ((?<ActionCode>.{1})" _ > & "(?<CarrierID>.{25})" _ > & "(?<LastName>.{60})" _ > & "(?<FirstName>.{30})" _ > & "(?<MiddleName>.{15})" _ > & "(?<Addr1>.{60})" _ > & "(?<Addr2>.{60})" _ > & "(?<City>.{30)" _ > & "(?<State>.{2})" _ > & "(?<Zip>.{10}))" > > ...and my computer actually laughed at me!! > > Back to the drawing board... Stephany,
Thanks for the code. I tried your sample, it doesn't seem to work. I'm assuming _s variable is the string to be parsed and need not necessarily have the fieldnames like Lastname etc, right? How does the regex engine know to take 26 characters for extracting City and that it is not the first 26 chrs. Please explain. And excuse me for my ignorance. Never used reg exprs. "hillcountry74" <shruth***@yahoo.com> wrote in Imports System.Textnews:1112201281.988499.58130@l41g2000cwc.googlegroups.com: > Stephany, > > Thanks for the code. I tried your sample, it doesn't seem to work. I'm > assuming _s variable is the string to be parsed and need not > necessarily have the fieldnames like Lastname etc, right? > > How does the regex engine know to take 26 characters for extracting > City and that it is not the first 26 chrs. Please explain. And excuse > me for my ignorance. Never used reg exprs. > Imports System.IO Imports System.Text.RegularExpressions Module Module1 Sub Main() Dim aStreamReader As TextReader aStreamReader = New StreamReader("C:\SAMPLE FILE.txt") Dim _s As String = aStreamReader.ReadToEnd aStreamReader.Close() Dim _exp As String = "((?<ActionCode>.{1})" & _ "(?<CarrierID>.{25})" & _ "(?<LastName>.{60})" & _ "(?<FirstName>.{30})" & _ "(?<MiddleName>.{15})" & _ "(?<Addr1>.{60})" & _ "(?<Addr2>.{60})" & _ "(?<City>.{30})" & _ "(?<State>.{2})" & _ "(?<Zip>.{10})" & _ "(?<BenefitOption>.{60})" & _ "(?<EmployerGroup>.{15})" & _ "(?<OptionEffDate>.{8})" & _ "(?<HPEffDate>.{8})" & _ "(?<TermDate>.{8})" & _ "(?<Sex>.{1})" & _ "(?<DOB>.{8})" & _ "(?<SSN>.{9})" & _ "(?<Phone>.{12})" & _ "(?<EmployerGroupAnivDate>.{8})" & _ "(?<HeadOfHouse>.{9})" & _ "(?<PrimaryStatus>.{1})" & _ "(?<MaritalStatus>.{1}))" Dim r As Regex = New Regex(_exp) Dim g As MatchCollection = r.Matches(_s) Dim m As Match Dim _sInActionCode As String Dim _sInCarrierID As String Dim _sInLastName As String Dim _sInFirstName As String Dim _sInMiddleName As String Dim _sInAddr1 As String Dim _sInAddr2 As String Dim _sInCity As String Dim _sInState As String Dim _sInZip As String Dim _sInBenefitOption As String Dim _sInEmployerGroup As String Dim _sInOptionEffDate As String Dim _sInHPEffDate As String Dim _sInTermDate As String Dim _sInSex As String Dim _sInDOB As String Dim _sInSSN As String Dim _sInPhone As String Dim _sInEmployerGroupAnivDate As String Dim _sInHeadOfHouse As String Dim _sInPrimaryStatus As String Dim _sInMaritalStatus As String Dim d As New DateTime Dim dt As Double d = DateTime.Now For i As Int32 = 0 To g.Count - 1 m = g.Item(i) _sInActionCode = m.Groups("ActionCode").ToString.Trim _sInCarrierID = m.Groups("CarrierID").ToString.Trim _sInLastName = m.Groups("LastName").ToString.Trim _sInFirstName = m.Groups("FirstName").ToString.Trim _sInMiddleName = m.Groups("MiddleName").ToString.Trim _sInAddr1 = m.Groups("Addr1").ToString.Trim _sInAddr2 = m.Groups("Addr2").ToString.Trim _sInCity = m.Groups("City").ToString.Trim _sInState = m.Groups("State").ToString.Trim _sInZip = m.Groups("Zip").ToString.Trim _sInBenefitOption = m.Groups("BenefitOption").ToString.Trim _sInEmployerGroup = m.Groups("EmployerGroup").ToString.Trim _sInOptionEffDate = m.Groups("OptionEffDate").ToString.Trim _sInHPEffDate = m.Groups("OptionEffDate").ToString.Trim _sInTermDate = m.Groups("HPEffDate").ToString.Trim _sInSex = m.Groups("TermDate").ToString.Trim _sInDOB = m.Groups("DOB").ToString.Trim _sInSSN = m.Groups("SSN").ToString.Trim _sInPhone = m.Groups("Phone").ToString.Trim _sInEmployerGroupAnivDate = m.Groups ("EmployerGroupAnivDate").ToString.Trim() _sInHeadOfHouse = m.Groups("HeadOfHouse").ToString.Trim _sInPrimaryStatus = m.Groups("PrimaryStatus").ToString.Trim _sInMaritalStatus = m.Groups("MaritalStatus").ToString.Trim 'Console.WriteLine() Console.WriteLine(i) 'Console.WriteLine() 'Console.WriteLine("_sInActionCode = " & _sInActionCode) 'Console.WriteLine("_sInCarrierID = " & _sInCarrierID) 'Console.WriteLine("_sInLastName = " & _sInLastName) 'Console.WriteLine("_sInFirstName = " & _sInFirstName) 'Console.WriteLine("_sInMiddleName = " & _sInMiddleName) 'Console.WriteLine("_sInAddr1 = " & _sInAddr1) 'Console.WriteLine("_sInAddr2 = " & _sInAddr2) 'Console.WriteLine("_sInCity = " & _sInCity) 'Console.WriteLine("_sInState = " & _sInState) 'Console.WriteLine("_sInZip = " & _sInZip) 'Console.WriteLine("_sInBenefitOption = " & _sInBenefitOption) 'Console.WriteLine("_sInEmployerGroup = " & _sInEmployerGroup) 'Console.WriteLine("_sInOptionEffDate = " & _sInOptionEffDate) 'Console.WriteLine("_sInHPEffDate = " & _sInHPEffDate) 'Console.WriteLine("_sInTermDate = " & _sInTermDate) 'Console.WriteLine("_sInSex = " & _sInSex) 'Console.WriteLine("_sInDOB = " & _sInDOB) 'Console.WriteLine("_sInSSN = " & _sInSSN) 'Console.WriteLine("_sInPhone = " & _sInPhone) 'Console.WriteLine("_sInEmployerGroupAnivDate = " & _sInEmployerGroupAnivDate) 'Console.WriteLine("_sInHeadOfHouse = " & _sInHeadOfHouse) 'Console.WriteLine("_sInPrimaryStatus = " & _sInPrimaryStatus) 'Console.WriteLine("_sInMaritalStatus = " & _sInMaritalStatus) Next Dim dt2 = DateTime.Now.Subtract(d).TotalSeconds Console.WriteLine(dt2) Console.ReadLine() End Sub End Module Sorry the code is a little messy. But it works. Parsed 8064 Records in 2 seconds flat. Simulate some other work by outputing everything to a console window an it takes 34 seconds. To answer your question RegEx uses a position marker *simular* to that of reading a file where the position is incremented relative to the amount read (for comparison sakes). So just telling it how much to read is good enough. Thankyou Stephany for writing that all out:) A note about your sample file: I hope fields were left blank, and things like HeadOfHouse is a number, otherwise this isn't working. Sample: _sInActionCode = _sInCarrierID = 00000050101 _sInLastName = SMITH _sInFirstName = VICKI _sInMiddleName = _sInAddr1 = C/O SUE EDDY - MISD BENEFITS _sInAddr2 = 405 EAST DAVIS _sInCity = MESQUITE _sInState = TX _sInZip = 75149 _sInBenefitOption = 001 _sInEmployerGroup = 2002MISD _sInOptionEffDate = 20050301 _sInHPEffDate = 20050301 _sInTermDate = 20040401 _sInSex = 20050331 _sInDOB = 19510125 _sInSSN = 000010009 _sInPhone = _sInEmployerGroupAnivDate = _sInHeadOfHouse = 464088770 _sInPrimaryStatus = P _sInMaritalStatus = I Let me know, MP Thanks a lot MP. Really appreciate your help.
Can you please paste the regular expression for this? Can't find it in the code. Also, on the headofhouse, it could be alphanumeric. And yes, some fields would be blank. There could be files of size 400MB. In such a case, reading till endoffile might not work. Instead, if it is changed to reading one line at a time, do you think the speed will reduce? Thanks again.
Show quote
Hide quote
"hillcountry74" <shruth***@yahoo.com> wrote in This is the Regular Expression:news:1112223769.480509.272010@g14g2000cwa.googlegroups.com: > Thanks a lot MP. Really appreciate your help. > > Can you please paste the regular expression for this? Can't find it in > the code. > > Also, on the headofhouse, it could be alphanumeric. And yes, some > fields would be blank. > > There could be files of size 400MB. In such a case, reading till > endoffile might not work. Instead, if it is changed to reading one line > at a time, do you think the speed will reduce? > > Thanks again. > Dim _exp As String = "((?<ActionCode>.{1})" & _ "(?<CarrierID>.{25})" & _ "(?<LastName>.{60})" & _ "(?<FirstName>.{30})" & _ "(?<MiddleName>.{15})" & _ "(?<Addr1>.{60})" & _ "(?<Addr2>.{60})" & _ "(?<City>.{30})" & _ "(?<State>.{2})" & _ "(?<Zip>.{10})" & _ "(?<BenefitOption>.{60})" & _ "(?<EmployerGroup>.{15})" & _ "(?<OptionEffDate>.{8})" & _ "(?<HPEffDate>.{8})" & _ "(?<TermDate>.{8})" & _ "(?<Sex>.{1})" & _ "(?<DOB>.{8})" & _ "(?<SSN>.{9})" & _ "(?<Phone>.{12})" & _ "(?<EmployerGroupAnivDate>.{8})" & _ "(?<HeadOfHouse>.{9})" & _ "(?<PrimaryStatus>.{1})" & _ "(?<MaritalStatus>.{1}))" A pretty good definition can be found on MSDN. Search For RegEx or Regular Expressions. :) I'll try to "simulate"(wink wink:) a 400mb file and check performance. Reading one line at a time is out of the question for this experiment, as it would require a couple of million reads(guessing), reading in 442bytes * nRecords would be better if not the best way to do it. BUT this and ReadToEnd both REQUIRE every record to be 442 bytes (or whatever it is) Off by one byte, and kiss you're records goodbye. MP I'm just a little concerned that you might have missed a critical point her
MeltingPoint. Each record in the file is terminated by a LF or a CR/LF pair. This is hown by the use of the ReadLine method in the original code fragment. A line is defined as a sequence of characters followed by a line feed or a carriage return immediately followed by a line feed. The string that is returned does not contain the terminating carriage return or line feed. The returned value is a null reference (Nothing in Visual Basic) if the end of the input stream is reached. If you use the ReadToEnd method then you have to identify what the record delimiter is and split the input into 'records' based on that delimiter before you can apply the RegEx anyway. Unless, of course the reGex is preceded by a '$' to indicate start at the beginning of each line. If you dont handle this then each record, subsequent to the first, will be off by 1 or 2 characters compounding. Show quoteHide quote "MeltingPoint" <n***@all.com> wrote in message news:FOqdnRmyx9xgqdbfRVn-ug@rogers.com... > "hillcountry74" <shruth***@yahoo.com> wrote in > news:1112223769.480509.272010@g14g2000cwa.googlegroups.com: > >> Thanks a lot MP. Really appreciate your help. >> >> Can you please paste the regular expression for this? Can't find it in >> the code. >> >> Also, on the headofhouse, it could be alphanumeric. And yes, some >> fields would be blank. >> >> There could be files of size 400MB. In such a case, reading till >> endoffile might not work. Instead, if it is changed to reading one > line >> at a time, do you think the speed will reduce? >> >> Thanks again. >> > > This is the Regular Expression: > Dim _exp As String = "((?<ActionCode>.{1})" & _ > "(?<CarrierID>.{25})" & _ > "(?<LastName>.{60})" & _ > "(?<FirstName>.{30})" & _ > "(?<MiddleName>.{15})" & _ > "(?<Addr1>.{60})" & _ > "(?<Addr2>.{60})" & _ > "(?<City>.{30})" & _ > "(?<State>.{2})" & _ > "(?<Zip>.{10})" & _ > "(?<BenefitOption>.{60})" & _ > "(?<EmployerGroup>.{15})" & _ > "(?<OptionEffDate>.{8})" & _ > "(?<HPEffDate>.{8})" & _ > "(?<TermDate>.{8})" & _ > "(?<Sex>.{1})" & _ > "(?<DOB>.{8})" & _ > "(?<SSN>.{9})" & _ > "(?<Phone>.{12})" & _ > "(?<EmployerGroupAnivDate>.{8})" & _ > "(?<HeadOfHouse>.{9})" & _ > "(?<PrimaryStatus>.{1})" & _ > "(?<MaritalStatus>.{1}))" > > A pretty good definition can be found on MSDN. Search For RegEx or > Regular Expressions. :) > > I'll try to "simulate"(wink wink:) a 400mb file and check performance. > Reading one line at a time is out of the question for this experiment, > as it would require a couple of million reads(guessing), reading in > 442bytes * nRecords would be better if not the best way to do it. BUT > this and ReadToEnd both REQUIRE every record to be 442 bytes (or > whatever it is) Off by one byte, and kiss you're records goodbye. > > MP
Show quote
Hide quote
"Stephany Young" <noone@localhost> wrote in I figured it out. Either way, if it is a fixed record then the cr would news:OFvYnVYNFHA.3844@TK2MSFTNGP14.phx.gbl: > I'm just a little concerned that you might have missed a critical > point her MeltingPoint. > > Each record in the file is terminated by a LF or a CR/LF pair. This is > hown by the use of the ReadLine method in the original code fragment. > > A line is defined as a sequence of characters followed by a line feed > or a carriage return immediately followed by a line feed. The string > that is returned does not contain the terminating carriage return or > line feed. The returned value is a null reference (Nothing in Visual > Basic) if the end of the input stream is reached. > > If you use the ReadToEnd method then you have to identify what the > record delimiter is and split the input into 'records' based on that > delimiter before you can apply the RegEx anyway. Unless, of course the > reGex is preceded by a '$' to indicate start at the beginning of each > line. > > If you dont handle this then each record, subsequent to the first, > will be off by 1 or 2 characters compounding. > > > > "MeltingPoint" <n***@all.com> wrote in message > news:FOqdnRmyx9xgqdbfRVn-ug@rogers.com... >> "hillcountry74" <shruth***@yahoo.com> wrote in >> news:1112223769.480509.272010@g14g2000cwa.googlegroups.com: >> >>> Thanks a lot MP. Really appreciate your help. >>> >>> Can you please paste the regular expression for this? Can't find it >>> in the code. >>> >>> Also, on the headofhouse, it could be alphanumeric. And yes, some >>> fields would be blank. >>> >>> There could be files of size 400MB. In such a case, reading till >>> endoffile might not work. Instead, if it is changed to reading one >> line >>> at a time, do you think the speed will reduce? >>> >>> Thanks again. >>> >> >> This is the Regular Expression: >> Dim _exp As String = "((?<ActionCode>.{1})" & _ >> "(?<CarrierID>.{25})" & _ >> "(?<LastName>.{60})" & _ >> "(?<FirstName>.{30})" & _ >> "(?<MiddleName>.{15})" & _ >> "(?<Addr1>.{60})" & _ >> "(?<Addr2>.{60})" & _ >> "(?<City>.{30})" & _ >> "(?<State>.{2})" & _ >> "(?<Zip>.{10})" & _ >> "(?<BenefitOption>.{60})" & _ >> "(?<EmployerGroup>.{15})" & _ >> "(?<OptionEffDate>.{8})" & _ >> "(?<HPEffDate>.{8})" & _ >> "(?<TermDate>.{8})" & _ >> "(?<Sex>.{1})" & _ >> "(?<DOB>.{8})" & _ >> "(?<SSN>.{9})" & _ >> "(?<Phone>.{12})" & _ >> "(?<EmployerGroupAnivDate>.{8})" & _ >> "(?<HeadOfHouse>.{9})" & _ >> "(?<PrimaryStatus>.{1})" & _ >> "(?<MaritalStatus>.{1}))" >> >> A pretty good definition can be found on MSDN. Search For RegEx or >> Regular Expressions. :) >> >> I'll try to "simulate"(wink wink:) a 400mb file and check >> performance. Reading one line at a time is out of the question for >> this experiment, as it would require a couple of million >> reads(guessing), reading in 442bytes * nRecords would be better if >> not the best way to do it. BUT this and ReadToEnd both REQUIRE every >> record to be 442 bytes (or whatever it is) Off by one byte, and kiss >> you're records goodbye. >> >> MP > > > be included in the record size. So the above would be 443*nRecords. No harm no foul. The point is the file can evenly be divided by the number of bytes in a record plus the delimiter (which is the first thing I asked him 10 posts ago and was told the was no delimiter). MP What was said was that the fields were not delimited.
The fact that there is a record delimiter is a given because of the use of the ReadLine method. Remember that the code in VB6 works and the 'ReadLine' method is a straight conversion of the 'Line Input' statement which does, ostensibly, the same thing. Anyway the detectives have been at work. Parsing a 100000 file of 422 characters per record in a line by line read using regex on my workstation takes approx 73 seconds. Parsing the same file in a line by line read using the Trim and Mid functions takes approx 15 seconds. Parsing the same file in a line by line read using the String.SubString and String.Trim methods takes approx 17 seconds. The VB6 equivalent takes approx 40 seconds. As I said in my first post, it is highly likely that the '15 second' difference was due to one of the the other methods that is executed on a per record basis, rather than the reading and parsing of the file and these results bear that out. Although it has been an interesting exercise, I don't think that regex is the way to go in this case. Show quoteHide quote "MeltingPoint" <n***@all.com> wrote in message news:68ydnYrAS_Xa1dbfRVn-3A@rogers.com... > "Stephany Young" <noone@localhost> wrote in > news:OFvYnVYNFHA.3844@TK2MSFTNGP14.phx.gbl: > >> I'm just a little concerned that you might have missed a critical >> point her MeltingPoint. >> >> Each record in the file is terminated by a LF or a CR/LF pair. This is >> hown by the use of the ReadLine method in the original code fragment. >> >> A line is defined as a sequence of characters followed by a line feed >> or a carriage return immediately followed by a line feed. The string >> that is returned does not contain the terminating carriage return or >> line feed. The returned value is a null reference (Nothing in Visual >> Basic) if the end of the input stream is reached. >> >> If you use the ReadToEnd method then you have to identify what the >> record delimiter is and split the input into 'records' based on that >> delimiter before you can apply the RegEx anyway. Unless, of course the >> reGex is preceded by a '$' to indicate start at the beginning of each >> line. >> >> If you dont handle this then each record, subsequent to the first, >> will be off by 1 or 2 characters compounding. >> >> >> >> "MeltingPoint" <n***@all.com> wrote in message >> news:FOqdnRmyx9xgqdbfRVn-ug@rogers.com... >>> "hillcountry74" <shruth***@yahoo.com> wrote in >>> news:1112223769.480509.272010@g14g2000cwa.googlegroups.com: >>> >>>> Thanks a lot MP. Really appreciate your help. >>>> >>>> Can you please paste the regular expression for this? Can't find it >>>> in the code. >>>> >>>> Also, on the headofhouse, it could be alphanumeric. And yes, some >>>> fields would be blank. >>>> >>>> There could be files of size 400MB. In such a case, reading till >>>> endoffile might not work. Instead, if it is changed to reading one >>> line >>>> at a time, do you think the speed will reduce? >>>> >>>> Thanks again. >>>> >>> >>> This is the Regular Expression: >>> Dim _exp As String = "((?<ActionCode>.{1})" & _ >>> "(?<CarrierID>.{25})" & _ >>> "(?<LastName>.{60})" & _ >>> "(?<FirstName>.{30})" & _ >>> "(?<MiddleName>.{15})" & _ >>> "(?<Addr1>.{60})" & _ >>> "(?<Addr2>.{60})" & _ >>> "(?<City>.{30})" & _ >>> "(?<State>.{2})" & _ >>> "(?<Zip>.{10})" & _ >>> "(?<BenefitOption>.{60})" & _ >>> "(?<EmployerGroup>.{15})" & _ >>> "(?<OptionEffDate>.{8})" & _ >>> "(?<HPEffDate>.{8})" & _ >>> "(?<TermDate>.{8})" & _ >>> "(?<Sex>.{1})" & _ >>> "(?<DOB>.{8})" & _ >>> "(?<SSN>.{9})" & _ >>> "(?<Phone>.{12})" & _ >>> "(?<EmployerGroupAnivDate>.{8})" & _ >>> "(?<HeadOfHouse>.{9})" & _ >>> "(?<PrimaryStatus>.{1})" & _ >>> "(?<MaritalStatus>.{1}))" >>> >>> A pretty good definition can be found on MSDN. Search For RegEx or >>> Regular Expressions. :) >>> >>> I'll try to "simulate"(wink wink:) a 400mb file and check >>> performance. Reading one line at a time is out of the question for >>> this experiment, as it would require a couple of million >>> reads(guessing), reading in 442bytes * nRecords would be better if >>> not the best way to do it. BUT this and ReadToEnd both REQUIRE every >>> record to be 442 bytes (or whatever it is) Off by one byte, and kiss >>> you're records goodbye. >>> >>> MP >> >> >> > > I figured it out. Either way, if it is a fixed record then the cr would > be included in the record size. So the above would be 443*nRecords. No > harm no foul. The point is the file can evenly be divided by the number > of bytes in a record plus the delimiter (which is the first thing I > asked him 10 posts ago and was told the was no delimiter). > > MP The speed when dealing with IO devices (disks, networks, etc.) is largely
subject because it depends on things like disk rpm, network bandwith, network usage, processior type and speed, memory size and a lot of other factors that it is not really not worth losing any sleep over. Can we digress back to your original post and address your perception of 'slowness'. You said that your VB.NET version takes 15 seconds longer than your VB6 version. Now, that is 15 seconds longer in relation to what? - If you take a specific file and run it through the VB6 version then how long does it take? - If you run that same file through the VB.NET version then how long does that take? - How many records were in the file? If you run that same file through the VB.NET version again almost immediately, then is the the run time any different than the first time. Then we come to some usage scenario questions: At what time of day does the VB6 version run. - Is it run by a user during the course of the business day? - Is it run as a 'batch process' at an 'off peak' time? - At what time of day were you running the VB.NET version? - Was the VB.NET version run on the same hardware as the VB6 version? Do you see what I'm driving at? The question really is - Are we comparing apples with apples and is a '15 second' difference really relevant? If the VB6 version takes, for example, more than 2 minutes to process, say, 10000 records (approx 4MB), then I would suggest that an additional 15 seconds to be insignificant. If, however, the VB6 version takes, for example, less than 10 seconds to process, say, 10000 records (approx 4MB), then, obviously, an additional 15 seconds is highly significant. You say that files could be up to 400MB which indicates somewhre around 1000000 records. - Is this file size a regular occurrence or does this size occur only occasionally? - How long does the VB6 version take to process a file of this size and how long does the VB.NET version take. - If it takes longer, is the time differenece relevant to the number of records? For example, if it takes 15 seconds longer to process 10000 records, does it take 1500 seconds longer to process 1000000 records (100 times the records ergo 100 times longer). If, for instance, it always takes 15 seconds longer, regardless of the number of records, then that would indicate that it's nothing to do with your processing code at all, rather the 'problem' would lie in the general program overhead under .NET. It would be interesting to hear your comments and/or finding of any/all of thses factors, remembering, of course, that the the factors I have thrown into the ring are really only scratching the surface. Show quoteHide quote "hillcountry74" <shruth***@yahoo.com> wrote in message news:1112223769.480509.272010@g14g2000cwa.googlegroups.com... > Thanks a lot MP. Really appreciate your help. > > Can you please paste the regular expression for this? Can't find it in > the code. > > Also, on the headofhouse, it could be alphanumeric. And yes, some > fields would be blank. > > There could be files of size 400MB. In such a case, reading till > endoffile might not work. Instead, if it is changed to reading one line > at a time, do you think the speed will reduce? > > Thanks again. > Stephany, MP
I'm still reading the thread. I guess it's the time diff and so I'm not around when you guys are discussing. Stephany, to answer some of your questions: I've tested the same file in both Vb6 and VB.Net. This specific file has 4564 records and Vb6 processes it in avg of 35 secs and VB.Net takes an avg of 45 secs. If I process the same file again in Vb.Net, it's about the same speed +/- 1 sec. On the usage: 1. Yes, it is run by a user during the business day and sometimes when the file is too big like 400MB, it runs thru the following day and this is for the VB6 ver. 2. No, it is not run as a batch. Basically, the user selects a file processes it and then if there are additional files, continues to process one at atime. 3. I've been testing the .Net ver thru' out the day to check if time makes a diff. But I've noticed in the Vb6 ver, that at times(no specific time of the day) it is processing real fast and for the same file it slows down and then again picks up the speed. Note that no other application is run. Not sure what causes this. On the other hand, ..Net ver always processes about the same speed. Well, the 400MB files have more than 350K records. That's exactly, even I was thinking if I was here comparing apples to apple or not. 400MB file size is regular. Basically, this file is sent by our client adn we process the files convert it to our format and then run a backend job to update the database with this info. Our's is a healthcare industry. I've not tested the VB.Net version for a 400MB size. I just found out from the user who runs the VB6 ver and he said it takes about 7 1/2 hrs. So, I don't know if the there will be significant diff or not. To begin with, I started testing a smaller file. Since this is 15 secs slower, I decided to debug and try and optimize if necessary before testing a bigger file. We are re-writing this in Vb.Net as most of our other appls are already in .net and this is one of the older apps. I'm planning to test the 400MB file sometime today. Will keep you guys posted. Thanks. HillCountry,
I told you before that you should test this clean. I have seen in your sample a dataset, generic VBNet stuff and more what is not possible in VB6 and probably do you have in that part (with a quick look) not used the most optimal methods to load bulk data. Therefore when you ask if the VBNet IO function is slower than VB6 than you should in my opinion test the most common VB6 IO functions agains the most common VBNet IO functions to write files. What you now are doing is in my opinion comparing apples with fishes. Just my thought, Cor I have just sent an email to the email address that 'shows' for you.
Please let me know either way if you do or don't get it. Show quoteHide quote "hillcountry74" <shruth***@yahoo.com> wrote in message news:1112281681.094407.126860@z14g2000cwz.googlegroups.com... > Stephany, MP > > I'm still reading the thread. I guess it's the time diff and so I'm not > around when you guys are discussing. > > Stephany, to answer some of your questions: > > I've tested the same file in both Vb6 and VB.Net. This specific file > has 4564 records and Vb6 processes it in avg of 35 secs and VB.Net > takes an avg of 45 secs. If I process the same file again in Vb.Net, > it's about the same speed +/- 1 sec. > > On the usage: > 1. Yes, it is run by a user during the business day and sometimes when > the file is too big like 400MB, it runs thru the following day and this > is for the VB6 ver. > 2. No, it is not run as a batch. Basically, the user selects a file > processes it and then if there are additional files, continues to > process one at atime. > 3. I've been testing the .Net ver thru' out the day to check if time > makes a diff. But I've noticed in the Vb6 ver, that at times(no > specific time of the day) it is processing real fast and for the same > file it slows down and then again picks up the speed. Note that no > other application is run. Not sure what causes this. On the other hand, > .Net ver always processes about the same speed. > > Well, the 400MB files have more than 350K records. > > That's exactly, even I was thinking if I was here comparing apples to > apple or not. > > 400MB file size is regular. Basically, this file is sent by our client > adn we process the files convert it to our format and then run a > backend job to update the database with this info. Our's is a > healthcare industry. > > I've not tested the VB.Net version for a 400MB size. I just found out > from the user who runs the VB6 ver and he said it takes about 7 1/2 > hrs. So, I don't know if the there will be significant diff or not. To > begin with, I started testing a smaller file. Since this is 15 secs > slower, I decided to debug and try and optimize if necessary before > testing a bigger file. > > We are re-writing this in Vb.Net as most of our other appls are already > in .net and this is one of the older apps. > > I'm planning to test the 400MB file sometime today. Will keep you guys > posted. > > Thanks. > In you output, note that you've got some fields out of whack.
> _sInHPEffDate = m.Groups("OptionEffDate").ToString.Trim This shows in that the display of _sInSex is not 1 character.> _sInTermDate = m.Groups("HPEffDate").ToString.Trim > _sInSex = m.Groups("TermDate").ToString.Trim I don't understand your last comment: > A note about your sample file: I hope fields were left blank, and things What do you mean by 'I hope fields were left blank'? If you mean, for > like HeadOfHouse is a number, otherwise this isn't working. example, ActionCode being left blank if it is not supplied, i.e. a space character as a place holder for it, then I would assume yes because otherwise the original parsing routine would never work in the first place. When I refer to a value being 'missing' then I am really saying that the character positions that it would normally occupy are filled with spaces. I don't think that you can assume (unless you are privvy to something that I'm not) that HeadOfHouse is numeric. From an earlier post - Head of House Validation - chars "A-Z,.-'0-9" so this indicates any combination of characters in the list. The only other thing that can be implied about it is that if it is 'missing' then it is assigned the first 9 characters of CarrierId and there is no indication that CarrierId should be numeric. Anyway, checking that is the role of validation rather than parsing. Show quoteHide quote "MeltingPoint" <n***@all.com> wrote in message news:fOCdnd23SO3QjNbfRVn-vA@rogers.com... > "hillcountry74" <shruth***@yahoo.com> wrote in > news:1112201281.988499.58130@l41g2000cwc.googlegroups.com: > >> Stephany, >> >> Thanks for the code. I tried your sample, it doesn't seem to work. I'm >> assuming _s variable is the string to be parsed and need not >> necessarily have the fieldnames like Lastname etc, right? >> >> How does the regex engine know to take 26 characters for extracting >> City and that it is not the first 26 chrs. Please explain. And excuse >> me for my ignorance. Never used reg exprs. >> > > Imports System.Text > Imports System.IO > Imports System.Text.RegularExpressions > Module Module1 > > Sub Main() > > Dim aStreamReader As TextReader > aStreamReader = New StreamReader("C:\SAMPLE FILE.txt") > Dim _s As String = aStreamReader.ReadToEnd > aStreamReader.Close() > > Dim _exp As String = "((?<ActionCode>.{1})" & _ > "(?<CarrierID>.{25})" & _ > "(?<LastName>.{60})" & _ > "(?<FirstName>.{30})" & _ > "(?<MiddleName>.{15})" & _ > "(?<Addr1>.{60})" & _ > "(?<Addr2>.{60})" & _ > "(?<City>.{30})" & _ > "(?<State>.{2})" & _ > "(?<Zip>.{10})" & _ > "(?<BenefitOption>.{60})" & _ > "(?<EmployerGroup>.{15})" & _ > "(?<OptionEffDate>.{8})" & _ > "(?<HPEffDate>.{8})" & _ > "(?<TermDate>.{8})" & _ > "(?<Sex>.{1})" & _ > "(?<DOB>.{8})" & _ > "(?<SSN>.{9})" & _ > "(?<Phone>.{12})" & _ > "(?<EmployerGroupAnivDate>.{8})" & _ > "(?<HeadOfHouse>.{9})" & _ > "(?<PrimaryStatus>.{1})" & _ > "(?<MaritalStatus>.{1}))" > > Dim r As Regex = New Regex(_exp) > > Dim g As MatchCollection = r.Matches(_s) > Dim m As Match > > Dim _sInActionCode As String > Dim _sInCarrierID As String > Dim _sInLastName As String > Dim _sInFirstName As String > Dim _sInMiddleName As String > Dim _sInAddr1 As String > Dim _sInAddr2 As String > Dim _sInCity As String > Dim _sInState As String > Dim _sInZip As String > Dim _sInBenefitOption As String > Dim _sInEmployerGroup As String > Dim _sInOptionEffDate As String > Dim _sInHPEffDate As String > Dim _sInTermDate As String > Dim _sInSex As String > Dim _sInDOB As String > Dim _sInSSN As String > Dim _sInPhone As String > Dim _sInEmployerGroupAnivDate As String > Dim _sInHeadOfHouse As String > Dim _sInPrimaryStatus As String > Dim _sInMaritalStatus As String > Dim d As New DateTime > Dim dt As Double > > d = DateTime.Now > > For i As Int32 = 0 To g.Count - 1 > m = g.Item(i) > > _sInActionCode = m.Groups("ActionCode").ToString.Trim > _sInCarrierID = m.Groups("CarrierID").ToString.Trim > _sInLastName = m.Groups("LastName").ToString.Trim > _sInFirstName = m.Groups("FirstName").ToString.Trim > _sInMiddleName = m.Groups("MiddleName").ToString.Trim > _sInAddr1 = m.Groups("Addr1").ToString.Trim > _sInAddr2 = m.Groups("Addr2").ToString.Trim > _sInCity = m.Groups("City").ToString.Trim > _sInState = m.Groups("State").ToString.Trim > _sInZip = m.Groups("Zip").ToString.Trim > _sInBenefitOption = m.Groups("BenefitOption").ToString.Trim > _sInEmployerGroup = m.Groups("EmployerGroup").ToString.Trim > _sInOptionEffDate = m.Groups("OptionEffDate").ToString.Trim > _sInHPEffDate = m.Groups("OptionEffDate").ToString.Trim > _sInTermDate = m.Groups("HPEffDate").ToString.Trim > _sInSex = m.Groups("TermDate").ToString.Trim > _sInDOB = m.Groups("DOB").ToString.Trim > _sInSSN = m.Groups("SSN").ToString.Trim > _sInPhone = m.Groups("Phone").ToString.Trim > _sInEmployerGroupAnivDate = m.Groups > ("EmployerGroupAnivDate").ToString.Trim() > _sInHeadOfHouse = m.Groups("HeadOfHouse").ToString.Trim > _sInPrimaryStatus = m.Groups("PrimaryStatus").ToString.Trim > _sInMaritalStatus = m.Groups("MaritalStatus").ToString.Trim > 'Console.WriteLine() > Console.WriteLine(i) > 'Console.WriteLine() > 'Console.WriteLine("_sInActionCode = " & _sInActionCode) > 'Console.WriteLine("_sInCarrierID = " & _sInCarrierID) > 'Console.WriteLine("_sInLastName = " & _sInLastName) > 'Console.WriteLine("_sInFirstName = " & _sInFirstName) > 'Console.WriteLine("_sInMiddleName = " & _sInMiddleName) > 'Console.WriteLine("_sInAddr1 = " & _sInAddr1) > 'Console.WriteLine("_sInAddr2 = " & _sInAddr2) > 'Console.WriteLine("_sInCity = " & _sInCity) > 'Console.WriteLine("_sInState = " & _sInState) > 'Console.WriteLine("_sInZip = " & _sInZip) > 'Console.WriteLine("_sInBenefitOption = " & > _sInBenefitOption) > 'Console.WriteLine("_sInEmployerGroup = " & > _sInEmployerGroup) > 'Console.WriteLine("_sInOptionEffDate = " & > _sInOptionEffDate) > 'Console.WriteLine("_sInHPEffDate = " & _sInHPEffDate) > 'Console.WriteLine("_sInTermDate = " & _sInTermDate) > 'Console.WriteLine("_sInSex = " & _sInSex) > 'Console.WriteLine("_sInDOB = " & _sInDOB) > 'Console.WriteLine("_sInSSN = " & _sInSSN) > 'Console.WriteLine("_sInPhone = " & _sInPhone) > 'Console.WriteLine("_sInEmployerGroupAnivDate = " & > _sInEmployerGroupAnivDate) > 'Console.WriteLine("_sInHeadOfHouse = " & _sInHeadOfHouse) > 'Console.WriteLine("_sInPrimaryStatus = " & > _sInPrimaryStatus) > 'Console.WriteLine("_sInMaritalStatus = " & > _sInMaritalStatus) > Next > Dim dt2 = DateTime.Now.Subtract(d).TotalSeconds > > Console.WriteLine(dt2) > Console.ReadLine() > End Sub > > End Module > > Sorry the code is a little messy. But it works. Parsed 8064 Records in 2 > seconds flat. Simulate some other work by outputing everything to a > console window an it takes 34 seconds. > > To answer your question RegEx uses a position marker *simular* to that > of reading a file where the position is incremented relative to the > amount read (for comparison sakes). So just telling it how much to read > is good enough. > > Thankyou Stephany for writing that all out:) > > A note about your sample file: I hope fields were left blank, and things > like HeadOfHouse is a number, otherwise this isn't working. > Sample: > _sInActionCode = > _sInCarrierID = 00000050101 > _sInLastName = SMITH > _sInFirstName = VICKI > _sInMiddleName = > _sInAddr1 = C/O SUE EDDY - MISD BENEFITS > _sInAddr2 = 405 EAST DAVIS > _sInCity = MESQUITE > _sInState = TX > _sInZip = 75149 > _sInBenefitOption = 001 > _sInEmployerGroup = 2002MISD > _sInOptionEffDate = 20050301 > _sInHPEffDate = 20050301 > _sInTermDate = 20040401 > _sInSex = 20050331 > _sInDOB = 19510125 > _sInSSN = 000010009 > _sInPhone = > _sInEmployerGroupAnivDate = > _sInHeadOfHouse = 464088770 > _sInPrimaryStatus = P > _sInMaritalStatus = I > > Let me know, > MP > "Stephany Young" <noone@localhost> wrote in Sorry Stephany, post order is getting screwed up. The above is copied from news:#TjZp1XNFHA.2132@TK2MSFTNGP14.phx.gbl: > In you output, note that you've got some fields out of whack. > >> _sInHPEffDate = m.Groups("OptionEffDate").ToString.Trim >> _sInTermDate = m.Groups("HPEffDate").ToString.Trim >> _sInSex = m.Groups("TermDate").ToString.Trim > > This shows in that the display of _sInSex is not 1 character. > one of your posts(and thank you for it again), but thanks for pointing it out, I was wondering why sex was "45738495". As for the rest of your comment, I was sent some sample data from hillcountry74, and thought I was replying under his thread. Sorry for the confusion. By the way, do you mind if I ask what field your in? Cheers, MP I'm an IT Consultant, with close to 30 experience in the industry.
Since 1994 I have specialised in VB related software and have been using VB.NET and C#.NET since their first 'retail' release. I still have a few applications that I support in Vb4, VB5 and VB6 but all new development is in VB.Net or C#.NET. Show quoteHide quote "MeltingPoint" <n***@all.com> wrote in message news:QI-dnblgf6FyotbfRVn-uw@rogers.com... > "Stephany Young" <noone@localhost> wrote in > news:#TjZp1XNFHA.2132@TK2MSFTNGP14.phx.gbl: > >> In you output, note that you've got some fields out of whack. >> >>> _sInHPEffDate = m.Groups("OptionEffDate").ToString.Trim >>> _sInTermDate = m.Groups("HPEffDate").ToString.Trim >>> _sInSex = m.Groups("TermDate").ToString.Trim >> >> This shows in that the display of _sInSex is not 1 character. >> > > Sorry Stephany, post order is getting screwed up. The above is copied from > one of your posts(and thank you for it again), but thanks for pointing it > out, I was wondering why sex was "45738495". As for the rest of your > comment, I was sent some sample data from hillcountry74, and thought I was > replying under his thread. Sorry for the confusion. By the way, do you > mind > if I ask what field your in? > > Cheers, > MP MP,
Posting this msg for the 2nd time. Thanks a lot for the code. I really appreciate your help and time. Can you please post the regular expr for this as I can't find it in the code? As there could be files of size 400MB, reading till endoffile might not work. Instead, if it is changed to reading one line at a time, will it slowdown the parsing? Also, on the headofhouse, it can be alphanumeric. And yes, some fields could be blank. Thanks. MeltingPoint wrote: Show quoteHide quote > "hillcountry74" <shruth***@yahoo.com> wrote in m.Groups("BenefitOption").ToString.Trim> news:1112201281.988499.58130@l41g2000cwc.googlegroups.com: > > > Stephany, > > > > Thanks for the code. I tried your sample, it doesn't seem to work. I'm > > assuming _s variable is the string to be parsed and need not > > necessarily have the fieldnames like Lastname etc, right? > > > > How does the regex engine know to take 26 characters for extracting > > City and that it is not the first 26 chrs. Please explain. And excuse > > me for my ignorance. Never used reg exprs. > > > > Imports System.Text > Imports System.IO > Imports System.Text.RegularExpressions > Module Module1 > > Sub Main() > > Dim aStreamReader As TextReader > aStreamReader = New StreamReader("C:\SAMPLE FILE.txt") > Dim _s As String = aStreamReader.ReadToEnd > aStreamReader.Close() > > Dim _exp As String = "((?<ActionCode>.{1})" & _ > "(?<CarrierID>.{25})" & _ > "(?<LastName>.{60})" & _ > "(?<FirstName>.{30})" & _ > "(?<MiddleName>.{15})" & _ > "(?<Addr1>.{60})" & _ > "(?<Addr2>.{60})" & _ > "(?<City>.{30})" & _ > "(?<State>.{2})" & _ > "(?<Zip>.{10})" & _ > "(?<BenefitOption>.{60})" & _ > "(?<EmployerGroup>.{15})" & _ > "(?<OptionEffDate>.{8})" & _ > "(?<HPEffDate>.{8})" & _ > "(?<TermDate>.{8})" & _ > "(?<Sex>.{1})" & _ > "(?<DOB>.{8})" & _ > "(?<SSN>.{9})" & _ > "(?<Phone>.{12})" & _ > "(?<EmployerGroupAnivDate>.{8})" & _ > "(?<HeadOfHouse>.{9})" & _ > "(?<PrimaryStatus>.{1})" & _ > "(?<MaritalStatus>.{1}))" > > Dim r As Regex = New Regex(_exp) > > Dim g As MatchCollection = r.Matches(_s) > Dim m As Match > > Dim _sInActionCode As String > Dim _sInCarrierID As String > Dim _sInLastName As String > Dim _sInFirstName As String > Dim _sInMiddleName As String > Dim _sInAddr1 As String > Dim _sInAddr2 As String > Dim _sInCity As String > Dim _sInState As String > Dim _sInZip As String > Dim _sInBenefitOption As String > Dim _sInEmployerGroup As String > Dim _sInOptionEffDate As String > Dim _sInHPEffDate As String > Dim _sInTermDate As String > Dim _sInSex As String > Dim _sInDOB As String > Dim _sInSSN As String > Dim _sInPhone As String > Dim _sInEmployerGroupAnivDate As String > Dim _sInHeadOfHouse As String > Dim _sInPrimaryStatus As String > Dim _sInMaritalStatus As String > Dim d As New DateTime > Dim dt As Double > > d = DateTime.Now > > For i As Int32 = 0 To g.Count - 1 > m = g.Item(i) > > _sInActionCode = m.Groups("ActionCode").ToString.Trim > _sInCarrierID = m.Groups("CarrierID").ToString.Trim > _sInLastName = m.Groups("LastName").ToString.Trim > _sInFirstName = m.Groups("FirstName").ToString.Trim > _sInMiddleName = m.Groups("MiddleName").ToString.Trim > _sInAddr1 = m.Groups("Addr1").ToString.Trim > _sInAddr2 = m.Groups("Addr2").ToString.Trim > _sInCity = m.Groups("City").ToString.Trim > _sInState = m.Groups("State").ToString.Trim > _sInZip = m.Groups("Zip").ToString.Trim > _sInBenefitOption = > _sInEmployerGroup = m.Groups("EmployerGroup").ToString.Trim> _sInOptionEffDate = m.Groups("OptionEffDate").ToString.Trim> _sInHPEffDate = m.Groups("OptionEffDate").ToString.Trim m.Groups("PrimaryStatus").ToString.Trim> _sInTermDate = m.Groups("HPEffDate").ToString.Trim > _sInSex = m.Groups("TermDate").ToString.Trim > _sInDOB = m.Groups("DOB").ToString.Trim > _sInSSN = m.Groups("SSN").ToString.Trim > _sInPhone = m.Groups("Phone").ToString.Trim > _sInEmployerGroupAnivDate = m.Groups > ("EmployerGroupAnivDate").ToString.Trim() > _sInHeadOfHouse = m.Groups("HeadOfHouse").ToString.Trim > _sInPrimaryStatus = > _sInMaritalStatus = m.Groups("MaritalStatus").ToString.TrimShow quoteHide quote > 'Console.WriteLine() > Console.WriteLine(i) > 'Console.WriteLine() > 'Console.WriteLine("_sInActionCode = " & _sInActionCode) > 'Console.WriteLine("_sInCarrierID = " & _sInCarrierID) > 'Console.WriteLine("_sInLastName = " & _sInLastName) > 'Console.WriteLine("_sInFirstName = " & _sInFirstName) > 'Console.WriteLine("_sInMiddleName = " & _sInMiddleName) > 'Console.WriteLine("_sInAddr1 = " & _sInAddr1) > 'Console.WriteLine("_sInAddr2 = " & _sInAddr2) > 'Console.WriteLine("_sInCity = " & _sInCity) > 'Console.WriteLine("_sInState = " & _sInState) > 'Console.WriteLine("_sInZip = " & _sInZip) > 'Console.WriteLine("_sInBenefitOption = " & > _sInBenefitOption) > 'Console.WriteLine("_sInEmployerGroup = " & > _sInEmployerGroup) > 'Console.WriteLine("_sInOptionEffDate = " & > _sInOptionEffDate) > 'Console.WriteLine("_sInHPEffDate = " & _sInHPEffDate) > 'Console.WriteLine("_sInTermDate = " & _sInTermDate) > 'Console.WriteLine("_sInSex = " & _sInSex) > 'Console.WriteLine("_sInDOB = " & _sInDOB) > 'Console.WriteLine("_sInSSN = " & _sInSSN) > 'Console.WriteLine("_sInPhone = " & _sInPhone) > 'Console.WriteLine("_sInEmployerGroupAnivDate = " & > _sInEmployerGroupAnivDate) > 'Console.WriteLine("_sInHeadOfHouse = " & _sInHeadOfHouse) > 'Console.WriteLine("_sInPrimaryStatus = " & > _sInPrimaryStatus) > 'Console.WriteLine("_sInMaritalStatus = " & > _sInMaritalStatus) > Next > Dim dt2 = DateTime.Now.Subtract(d).TotalSeconds > > Console.WriteLine(dt2) > Console.ReadLine() > End Sub > > End Module > > Sorry the code is a little messy. But it works. Parsed 8064 Records in 2 > seconds flat. Simulate some other work by outputing everything to a > console window an it takes 34 seconds. > > To answer your question RegEx uses a position marker *simular* to that > of reading a file where the position is incremented relative to the > amount read (for comparison sakes). So just telling it how much to read > is good enough. > > Thankyou Stephany for writing that all out:) > > A note about your sample file: I hope fields were left blank, and things > like HeadOfHouse is a number, otherwise this isn't working. > Sample: > _sInActionCode = > _sInCarrierID = 00000050101 > _sInLastName = SMITH > _sInFirstName = VICKI > _sInMiddleName = > _sInAddr1 = C/O SUE EDDY - MISD BENEFITS > _sInAddr2 = 405 EAST DAVIS > _sInCity = MESQUITE > _sInState = TX > _sInZip = 75149 > _sInBenefitOption = 001 > _sInEmployerGroup = 2002MISD > _sInOptionEffDate = 20050301 > _sInHPEffDate = 20050301 > _sInTermDate = 20040401 > _sInSex = 20050331 > _sInDOB = 19510125 > _sInSSN = 000010009 > _sInPhone = > _sInEmployerGroupAnivDate = > _sInHeadOfHouse = 464088770 > _sInPrimaryStatus = P > _sInMaritalStatus = I > > Let me know, > MP
Show quote
Hide quote
"hillcountry74" <shruth***@yahoo.com> wrote in I just thought of something. How can you be using ReadLine if there's no news:1112224455.933751.78520@l41g2000cwc.googlegroups.com: > MP, > Posting this msg for the 2nd time. > > Thanks a lot for the code. I really appreciate your help and time. > > Can you please post the regular expr for this as I can't find it in the > code? > > As there could be files of size 400MB, reading till endoffile might not > work. Instead, if it is changed to reading one line at a time, will it > slowdown the parsing? > > Also, on the headofhouse, it can be alphanumeric. And yes, some fields > could be blank. > > Thanks. delimiters? The answer is: There is a delimiter. A carriage return at the end of every record. However, I don't thinks this helps the regex thing. But 'just so ya know' a delimiter can be anything, not just a comma! :) I'm just cleaning up the code so you can read in chunks at a time,(1.5 gigs of ram wasn't even enough to read in 400mb of text) will post back soon. By the by, are you using VB.NET? MP Yes, you are correct, _s is a string to simmulate a record read from your
input file and my use of fieldnames etc, was jsut to create some placeholders. Where it says e.g. <7 spaces>, you need to replace that bit (including the angle brackets) with that number of space characters. Because of the way newsgroup readers wrap text etc, it was difficult to show the actual values. The numbers in the curly brackets e.g. {8}, in the regex expression tell it how many character positions each element takes up. If you add them all up you will find that they come to 442 which is why you your input record must be a minimum of 442 characters long. If it is shorter then it won't work. If it is longer than only the first 442 characters are utilised. I have been assuming that your input records are, in fact, fixed-length fields in a fixed-length record. Is this actually the case or are there some records that are shorter. It would be nice if you could show a smaple record (doctored to hide sensitive information of course). In addition what did you find about my other comments regarding the stray 'unicode' character etc. Show quoteHide quote "hillcountry74" <shruth***@yahoo.com> wrote in message news:1112201281.988499.58130@l41g2000cwc.googlegroups.com... > Stephany, > > Thanks for the code. I tried your sample, it doesn't seem to work. I'm > assuming _s variable is the string to be parsed and need not > necessarily have the fieldnames like Lastname etc, right? > > How does the regex engine know to take 26 characters for extracting > City and that it is not the first 26 chrs. Please explain. And excuse > me for my ignorance. Never used reg exprs. > "hillcountry74" <shruth***@yahoo.com> wrote in Some good ideas so far. I've started to put the regex expressionnews:1112110584.467762.175290@o13g2000cwo.googlegroups.com: together for you, could have it done in a few hours. If you want to send me one of these files, (important info changed of course) I could fine tune the expression. macmanic(zero)(zero)atHotmail.com Note to anyone else reading this thread, Any ideas on the speed of regex as opposed to Substring/IndexOf. I can say for sure that I've parsed a 4mb file with regex in a few hundred milliseconds. ++Just saw stefs comment. I'm not sure what difference it makes as to weather its fixed or not. RegEx still works and its alot easier on the eyes:) ((?<ActionCode>.) (?<CarrierID>\d{0,25}) (?<LastName>\w{0,60}\s*\b) (?<FirstName>\w{0,30}\s*\b) (?<MiddleName>\w{0,15}\s*\b) (?<Addr1>.{0,60}\s*\b) (?<Addr2>.{0,60}\s*\b) (?<City>.{0,30}\s*\b) (?<State>.{0,2}\s*) (?<Zip>.{0,10}\s*\b)) Actually the fact that its fixed makes it easier. And a note as to how close I was paying attention: sPreferredInputFile.Trim.Length >= 439 does not 'allude' to me that it is totally fixed. However, I don't know Stef, she probably knows more than me, considering I just started using RegEx a month ago. But the above Regex does match the following: e8374837463784958473627495Sc9ott 8nglis Micheal 554 sdf sdf 667 rtert ertwert Hell FL90210 Which I think is what the record looks like (at least so far) Let me know, both of you :) MP Sorry about all the posts (xnews acting up)
<lots o code> Some good ideas so far. I've started to put the regex expression together for you, could have it done in a few hours. If you want to send me one of these files, (important info changed of course) I could fine tune the expression. macmanic(zero)(zero)atHotmail.com Note to anyone else reading this thread, Any ideas on the speed of regex as opposed to Substring/IndexOf. I can say for sure that I've parsed a 4mb file with regex in a few hundred milliseconds. ++Just saw stefs comment. I'm not sure what difference it makes as to weather its fixed or not. RegEx still works and its alot easier on the eyes:) ((?<ActionCode>.) (?<CarrierID>\d{0,25}) (?<LastName>\w{0,60}\s*\b) (?<FirstName>\w{0,30}\s*\b) (?<MiddleName>\w{0,15}\s*\b) (?<Addr1>.{0,60}\s*\b) (?<Addr2>.{0,60}\s*\b) (?<City>.{0,30}\s*\b) (?<State>.{0,2}\s*) (?<Zip>.{0,10}\s*\b)) Actually the fact that its fixed makes it easier. And a note as to how close I was paying attention: sPreferredInputFile.Trim.Length >= 439 does not 'allude' to me that it is totally fixed. However, I don't know Stef, she probably knows more than me, considering I just started using RegEx a month ago. But the above Regex does match the following: e8374837463784958473627495Tomlin Nilmot Micheal 554 Some Street 667 Some Other Street Hell FL90210 Which I think is what the record looks like (at least so far) Let me know, both of you :) MP "hillcountry74" <shruth***@yahoo.com> wrote in news:1112049500.118960.294240@f14g2000cwb.googlegroups.com: Inside the IDE - Not being displayed to console.Processed 83265 records in 5.6875 seconds. At 2 Records per pass. Processed 83265 records in 4.28125 seconds. At 20 Records per pass. Processed 83265 records in 4.046875 seconds. At 50 Records per pass. Processed 83265 records in 4.046875 seconds. At 75 Records per pass. Processed 83265 records in 4.765625 seconds. At 100 Records per pass. Breaking Point Reached. Compiled Application Processed 83265 records in 3.53125 seconds. At 75 Records per pass. Processed 83265 records in 3.625 seconds. At 100 Records per pass. Processed 83265 records in 3.59375 seconds. At 200 Records per pass. Processed 83265 records in 3.609375 seconds. At 500 Records per pass. Processed 83265 records in 3.625 seconds. At 1000 Records per pass. Processed 83265 records in 3.609375 seconds. At 10000 Records per pass. Processed 83265 records in 3.59375 seconds. At 50000 Records per pass. You be the judge. Heres the source code. Let me know if you need help with the verify routines. Imports System.Text Imports System.IO Imports System.Text.RegularExpressions Module Module1 Sub Main() 'File path and number of records to parse per pass ReadAndParse("C:\SAMPLE FILE.txt", 1000) End Sub #Region " Expression Definition " Dim _exp As String = "((?<ActionCode>.{1})" & _ "(?<CarrierID>.{25})" & _ "(?<LastName>.{60})" & _ "(?<FirstName>.{30})" & _ "(?<MiddleName>.{15})" & _ "(?<Addr1>.{60})" & _ "(?<Addr2>.{60})" & _ "(?<City>.{30})" & _ "(?<State>.{2})" & _ "(?<Zip>.{10})" & _ "(?<BenefitOption>.{60})" & _ "(?<EmployerGroup>.{15})" & _ "(?<OptionEffDate>.{8})" & _ "(?<HPEffDate>.{8})" & _ "(?<TermDate>.{8})" & _ "(?<Sex>.{1})" & _ "(?<DOB>.{8})" & _ "(?<SSN>.{9})" & _ "(?<Phone>.{12})" & _ "(?<EmployerGroupAnivDate>.{8})" & _ "(?<HeadOfHouse>.{9})" & _ "(?<PrimaryStatus>.{1})" & _ "(?<MaritalStatus>.{1}))" #End Region #Region " Label Definitions " Dim _sInActionCode As String Dim _sInCarrierID As String Dim _sInLastName As String Dim _sInFirstName As String Dim _sInMiddleName As String Dim _sInAddr1 As String Dim _sInAddr2 As String Dim _sInCity As String Dim _sInState As String Dim _sInZip As String Dim _sInBenefitOption As String Dim _sInEmployerGroup As String Dim _sInOptionEffDate As String Dim _sInHPEffDate As String Dim _sInTermDate As String Dim _sInSex As String Dim _sInDOB As String Dim _sInSSN As String Dim _sInPhone As String Dim _sInEmployerGroupAnivDate As String Dim _sInHeadOfHouse As String Dim _sInPrimaryStatus As String Dim _sInMaritalStatus As String #End Region #Region " Timing " Dim startTime As New DateTime Dim finishTime As Double #End Region Sub ReadAndParse(ByVal inFilePath As String, ByVal numRecordsPerBlock As Int32) Const RECORD_SIZE As Int32 = 443 Dim inputFile As New FileInfo(inFilePath) Dim inputFileLen As Int64 = inputFile.Length Dim iterations As Int32 Dim bytesPerIteration As Int32 Dim totalRecords As Int32 Dim moreRecords As Boolean 'Verify Length If Not inputFileLen Mod 443 = 0 Then Throw New ApplicationException("File Length Error") End If 'Figure out how many times to loop iterations = inputFileLen \ (numRecordsPerBlock * RECORD_SIZE) 'Bytes(records) per loop bytesPerIteration = numRecordsPerBlock * RECORD_SIZE 'Check to see if we got lucky moreRecords = ((iterations * RECORD_SIZE) <> inputFileLen) 'reset total records totalRecords = 0 'Get input stream Dim inStream As New StreamReader(inputFile.FullName) Dim inputBlock As String Dim buf(bytesPerIteration) As Char 'Set up regex Dim regExp As New Regex(_exp, RegexOptions.Compiled) ' I think this speeds it up' Dim mc As MatchCollection Dim record As Match 'Set up and loop startTime = Now() For i As Int32 = 1 To iterations inStream.ReadBlock(buf, 0, bytesPerIteration) inputBlock = New String(buf) 'Parse it mc = regExp.Matches(inputBlock) For j As Int32 = 0 To mc.Count - 1 record = mc.Item(j) 'Verify record proc here totalRecords += 1 _sInActionCode = record.Groups("ActionCode").ToString.Trim _sInCarrierID = record.Groups("CarrierID").ToString.Trim _sInLastName = record.Groups("LastName").ToString.Trim _sInFirstName = record.Groups("FirstName").ToString.Trim _sInMiddleName = record.Groups("MiddleName").ToString.Trim _sInAddr1 = record.Groups("Addr1").ToString.Trim _sInAddr2 = record.Groups("Addr2").ToString.Trim _sInCity = record.Groups("City").ToString.Trim _sInState = record.Groups("State").ToString.Trim _sInZip = record.Groups("Zip").ToString.Trim _sInBenefitOption = record.Groups("BenefitOption").ToString.Trim _sInEmployerGroup = record.Groups("EmployerGroup").ToString.Trim _sInOptionEffDate = record.Groups("OptionEffDate").ToString.Trim _sInHPEffDate = record.Groups("HPEffDate").ToString.Trim _sInTermDate = record.Groups("TermDate").ToString.Trim _sInSex = record.Groups("Sex").ToString.Trim _sInDOB = record.Groups("DOB").ToString.Trim _sInSSN = record.Groups("SSN").ToString.Trim _sInPhone = record.Groups("Phone").ToString.Trim _sInEmployerGroupAnivDate = record.Groups("EmployerGroupAnivDate").ToString.Trim() _sInHeadOfHouse = record.Groups("HeadOfHouse").ToString.Trim _sInPrimaryStatus = record.Groups("PrimaryStatus").ToString.Trim _sInMaritalStatus = record.Groups("MaritalStatus").ToString.Trim 'REMOVE DisplayToConsole(record) 'END REMOVE Next Next 'One last time through If moreRecords Then inputBlock = inStream.ReadToEnd() 'Finish off reading inStream.Close() mc = regExp.Matches(inputBlock) For j As Int32 = 0 To mc.Count - 1 record = mc.Item(j) 'Verify record proc here totalRecords += 1 _sInActionCode = record.Groups("ActionCode").ToString.Trim _sInCarrierID = record.Groups("CarrierID").ToString.Trim _sInLastName = record.Groups("LastName").ToString.Trim _sInFirstName = record.Groups("FirstName").ToString.Trim _sInMiddleName = record.Groups("MiddleName").ToString.Trim _sInAddr1 = record.Groups("Addr1").ToString.Trim _sInAddr2 = record.Groups("Addr2").ToString.Trim _sInCity = record.Groups("City").ToString.Trim _sInState = record.Groups("State").ToString.Trim _sInZip = record.Groups("Zip").ToString.Trim _sInBenefitOption = record.Groups("BenefitOption").ToString.Trim _sInEmployerGroup = record.Groups("EmployerGroup").ToString.Trim _sInOptionEffDate = record.Groups("OptionEffDate").ToString.Trim _sInHPEffDate = record.Groups("HPEffDate").ToString.Trim _sInTermDate = record.Groups("TermDate").ToString.Trim _sInSex = record.Groups("Sex").ToString.Trim _sInDOB = record.Groups("DOB").ToString.Trim _sInSSN = record.Groups("SSN").ToString.Trim _sInPhone = record.Groups("Phone").ToString.Trim _sInEmployerGroupAnivDate = record.Groups("EmployerGroupAnivDate").ToString.Trim() _sInHeadOfHouse = record.Groups("HeadOfHouse").ToString.Trim _sInPrimaryStatus = record.Groups("PrimaryStatus").ToString.Trim _sInMaritalStatus = record.Groups("MaritalStatus").ToString.Trim 'REMOVE DisplayToConsole(record) 'END REMOVE Next Else inStream.Close() End If Dim finishTime = DateTime.Now.Subtract(startTime).TotalSeconds Console.WriteLine() Console.WriteLine("Processed {0} records in {1} seconds.", totalRecords, finishTime) Console.ReadLine() End Sub Sub DisplayToConsole(ByVal record As Match) _sInActionCode = record.Groups("ActionCode").ToString.Trim _sInCarrierID = record.Groups("CarrierID").ToString.Trim _sInLastName = record.Groups("LastName").ToString.Trim _sInFirstName = record.Groups("FirstName").ToString.Trim _sInMiddleName = record.Groups("MiddleName").ToString.Trim _sInAddr1 = record.Groups("Addr1").ToString.Trim _sInAddr2 = record.Groups("Addr2").ToString.Trim _sInCity = record.Groups("City").ToString.Trim _sInState = record.Groups("State").ToString.Trim _sInZip = record.Groups("Zip").ToString.Trim _sInBenefitOption = record.Groups("BenefitOption").ToString.Trim _sInEmployerGroup = record.Groups("EmployerGroup").ToString.Trim _sInOptionEffDate = record.Groups("OptionEffDate").ToString.Trim _sInHPEffDate = record.Groups("HPEffDate").ToString.Trim _sInTermDate = record.Groups("TermDate").ToString.Trim _sInSex = record.Groups("Sex").ToString.Trim _sInDOB = record.Groups("DOB").ToString.Trim _sInSSN = record.Groups("SSN").ToString.Trim _sInPhone = record.Groups("Phone").ToString.Trim _sInEmployerGroupAnivDate = record.Groups("EmployerGroupAnivDate").ToString.Trim() _sInHeadOfHouse = record.Groups("HeadOfHouse").ToString.Trim _sInPrimaryStatus = record.Groups("PrimaryStatus").ToString.Trim _sInMaritalStatus = record.Groups("MaritalStatus").ToString.Trim Console.WriteLine() Console.WriteLine("_sInActionCode = " & _sInActionCode) Console.WriteLine("_sInCarrierID = " & _sInCarrierID) Console.WriteLine("_sInLastName = " & _sInLastName) Console.WriteLine("_sInFirstName = " & _sInFirstName) Console.WriteLine("_sInMiddleName = " & _sInMiddleName) Console.WriteLine("_sInAddr1 = " & _sInAddr1) Console.WriteLine("_sInAddr2 = " & _sInAddr2) Console.WriteLine("_sInCity = " & _sInCity) Console.WriteLine("_sInState = " & _sInState) Console.WriteLine("_sInZip = " & _sInZip) Console.WriteLine("_sInBenefitOption = " & _sInBenefitOption) Console.WriteLine("_sInEmployerGroup = " & _sInEmployerGroup) Console.WriteLine("_sInOptionEffDate = " & _sInOptionEffDate) Console.WriteLine("_sInHPEffDate = " & _sInHPEffDate) Console.WriteLine("_sInTermDate = " & _sInTermDate) Console.WriteLine("_sInSex = " & _sInSex) Console.WriteLine("_sInDOB = " & _sInDOB) Console.WriteLine("_sInSSN = " & _sInSSN) Console.WriteLine("_sInPhone = " & _sInPhone) Console.WriteLine("_sInEmployerGroupAnivDate = " & _sInEmployerGroupAnivDate) Console.WriteLine("_sInHeadOfHouse = " & _sInHeadOfHouse) Console.WriteLine("_sInPrimaryStatus = " & _sInPrimaryStatus) Console.WriteLine("_sInMaritalStatus = " & _sInMaritalStatus) End Sub End Module Using your code, I am seeing similar results give or take a few
milliseconds. I noticed that the best results were achieved using about 40 records per block. Maybe RegEx uses some sort of optimisation based on around about 16KB. We obviously have differing amounts of RAM because I can do a ReadToEnd on a 200MB + file quite happily. My machine spits it's dummy at just over 260 MB. That level, of course will vary depending on whatever else is running at the time. Using the ReadLine method on an 83265 record file, and using different combinations of Mid, Trim, String.SubString and String.Trim I am seeing results of between 1.5 and 2 seconds to pase the entire file, depending on the combination. The difference between running it compiled for debug configuration in the IDE and release configuration is insignificant (less than 100 milliseconds). Unfortunately you haven't provided any comparative reslts for your machine. The evidence I see is that RegEx is actually a poor performer compared to more convential string parsing in this particular case. I am still of the opinion that the 'percieved slowness' is in one of the other functions that is called on a per record basis rather in the file IO/record parsing area per se. Show quoteHide quote "MeltingPoint" <n***@all.com> wrote in message news:PfednS7FVvZH-dbfRVn-1Q@rogers.com... > "hillcountry74" <shruth***@yahoo.com> wrote in > news:1112049500.118960.294240@f14g2000cwb.googlegroups.com: > > Inside the IDE - Not being displayed to console. > Processed 83265 records in 5.6875 seconds. At 2 Records per pass. > Processed 83265 records in 4.28125 seconds. At 20 Records per pass. > Processed 83265 records in 4.046875 seconds. At 50 Records per pass. > Processed 83265 records in 4.046875 seconds. At 75 Records per pass. > Processed 83265 records in 4.765625 seconds. At 100 Records per pass. > Breaking Point Reached. > > Compiled Application > Processed 83265 records in 3.53125 seconds. At 75 Records per pass. > Processed 83265 records in 3.625 seconds. At 100 Records per pass. > Processed 83265 records in 3.59375 seconds. At 200 Records per pass. > Processed 83265 records in 3.609375 seconds. At 500 Records per pass. > Processed 83265 records in 3.625 seconds. At 1000 Records per pass. > Processed 83265 records in 3.609375 seconds. At 10000 Records per pass. > Processed 83265 records in 3.59375 seconds. At 50000 Records per pass. > > You be the judge. > Heres the source code. > Let me know if you need help with the verify routines. > > Imports System.Text > Imports System.IO > Imports System.Text.RegularExpressions > Module Module1 > > Sub Main() > 'File path and number of records to parse per pass > ReadAndParse("C:\SAMPLE FILE.txt", 1000) > End Sub > > #Region " Expression Definition " > Dim _exp As String = "((?<ActionCode>.{1})" & _ > "(?<CarrierID>.{25})" & _ > "(?<LastName>.{60})" & _ > "(?<FirstName>.{30})" & _ > "(?<MiddleName>.{15})" & _ > "(?<Addr1>.{60})" & _ > "(?<Addr2>.{60})" & _ > "(?<City>.{30})" & _ > "(?<State>.{2})" & _ > "(?<Zip>.{10})" & _ > "(?<BenefitOption>.{60})" & _ > "(?<EmployerGroup>.{15})" & _ > "(?<OptionEffDate>.{8})" & _ > "(?<HPEffDate>.{8})" & _ > "(?<TermDate>.{8})" & _ > "(?<Sex>.{1})" & _ > "(?<DOB>.{8})" & _ > "(?<SSN>.{9})" & _ > "(?<Phone>.{12})" & _ > "(?<EmployerGroupAnivDate>.{8})" & _ > "(?<HeadOfHouse>.{9})" & _ > "(?<PrimaryStatus>.{1})" & _ > "(?<MaritalStatus>.{1}))" > #End Region > > #Region " Label Definitions " > Dim _sInActionCode As String > Dim _sInCarrierID As String > Dim _sInLastName As String > Dim _sInFirstName As String > Dim _sInMiddleName As String > Dim _sInAddr1 As String > Dim _sInAddr2 As String > Dim _sInCity As String > Dim _sInState As String > Dim _sInZip As String > Dim _sInBenefitOption As String > Dim _sInEmployerGroup As String > Dim _sInOptionEffDate As String > Dim _sInHPEffDate As String > Dim _sInTermDate As String > Dim _sInSex As String > Dim _sInDOB As String > Dim _sInSSN As String > Dim _sInPhone As String > Dim _sInEmployerGroupAnivDate As String > Dim _sInHeadOfHouse As String > Dim _sInPrimaryStatus As String > Dim _sInMaritalStatus As String > #End Region > > #Region " Timing " > Dim startTime As New DateTime > Dim finishTime As Double > #End Region > > > > Sub ReadAndParse(ByVal inFilePath As String, ByVal numRecordsPerBlock > As Int32) > Const RECORD_SIZE As Int32 = 443 > Dim inputFile As New FileInfo(inFilePath) > Dim inputFileLen As Int64 = inputFile.Length > Dim iterations As Int32 > Dim bytesPerIteration As Int32 > Dim totalRecords As Int32 > Dim moreRecords As Boolean > > 'Verify Length > If Not inputFileLen Mod 443 = 0 Then > Throw New ApplicationException("File Length Error") > End If > > 'Figure out how many times to loop > iterations = inputFileLen \ (numRecordsPerBlock * RECORD_SIZE) > 'Bytes(records) per loop > bytesPerIteration = numRecordsPerBlock * RECORD_SIZE > 'Check to see if we got lucky > moreRecords = ((iterations * RECORD_SIZE) <> inputFileLen) > 'reset total records > totalRecords = 0 > > 'Get input stream > Dim inStream As New StreamReader(inputFile.FullName) > Dim inputBlock As String > Dim buf(bytesPerIteration) As Char > > 'Set up regex > Dim regExp As New Regex(_exp, RegexOptions.Compiled) ' I think this > speeds it up' > Dim mc As MatchCollection > Dim record As Match > > 'Set up and loop > startTime = Now() > For i As Int32 = 1 To iterations > > inStream.ReadBlock(buf, 0, bytesPerIteration) > inputBlock = New String(buf) > > 'Parse it > mc = regExp.Matches(inputBlock) > For j As Int32 = 0 To mc.Count - 1 > record = mc.Item(j) > 'Verify record proc here > totalRecords += 1 > _sInActionCode = record.Groups("ActionCode").ToString.Trim > _sInCarrierID = record.Groups("CarrierID").ToString.Trim > _sInLastName = record.Groups("LastName").ToString.Trim > _sInFirstName = record.Groups("FirstName").ToString.Trim > _sInMiddleName = record.Groups("MiddleName").ToString.Trim > _sInAddr1 = record.Groups("Addr1").ToString.Trim > _sInAddr2 = record.Groups("Addr2").ToString.Trim > _sInCity = record.Groups("City").ToString.Trim > _sInState = record.Groups("State").ToString.Trim > _sInZip = record.Groups("Zip").ToString.Trim > _sInBenefitOption = > record.Groups("BenefitOption").ToString.Trim > _sInEmployerGroup = > record.Groups("EmployerGroup").ToString.Trim > _sInOptionEffDate = > record.Groups("OptionEffDate").ToString.Trim > _sInHPEffDate = record.Groups("HPEffDate").ToString.Trim > _sInTermDate = record.Groups("TermDate").ToString.Trim > _sInSex = record.Groups("Sex").ToString.Trim > _sInDOB = record.Groups("DOB").ToString.Trim > _sInSSN = record.Groups("SSN").ToString.Trim > _sInPhone = record.Groups("Phone").ToString.Trim > _sInEmployerGroupAnivDate = > record.Groups("EmployerGroupAnivDate").ToString.Trim() > _sInHeadOfHouse = > record.Groups("HeadOfHouse").ToString.Trim > _sInPrimaryStatus = > record.Groups("PrimaryStatus").ToString.Trim > _sInMaritalStatus = > record.Groups("MaritalStatus").ToString.Trim > 'REMOVE > DisplayToConsole(record) > 'END REMOVE > > Next > Next > > 'One last time through > If moreRecords Then > inputBlock = inStream.ReadToEnd() 'Finish off reading > inStream.Close() > mc = regExp.Matches(inputBlock) > For j As Int32 = 0 To mc.Count - 1 > record = mc.Item(j) > 'Verify record proc here > totalRecords += 1 > _sInActionCode = record.Groups("ActionCode").ToString.Trim > _sInCarrierID = record.Groups("CarrierID").ToString.Trim > _sInLastName = record.Groups("LastName").ToString.Trim > _sInFirstName = record.Groups("FirstName").ToString.Trim > _sInMiddleName = record.Groups("MiddleName").ToString.Trim > _sInAddr1 = record.Groups("Addr1").ToString.Trim > _sInAddr2 = record.Groups("Addr2").ToString.Trim > _sInCity = record.Groups("City").ToString.Trim > _sInState = record.Groups("State").ToString.Trim > _sInZip = record.Groups("Zip").ToString.Trim > _sInBenefitOption = > record.Groups("BenefitOption").ToString.Trim > _sInEmployerGroup = > record.Groups("EmployerGroup").ToString.Trim > _sInOptionEffDate = > record.Groups("OptionEffDate").ToString.Trim > _sInHPEffDate = record.Groups("HPEffDate").ToString.Trim > _sInTermDate = record.Groups("TermDate").ToString.Trim > _sInSex = record.Groups("Sex").ToString.Trim > _sInDOB = record.Groups("DOB").ToString.Trim > _sInSSN = record.Groups("SSN").ToString.Trim > _sInPhone = record.Groups("Phone").ToString.Trim > _sInEmployerGroupAnivDate = > record.Groups("EmployerGroupAnivDate").ToString.Trim() > _sInHeadOfHouse = > record.Groups("HeadOfHouse").ToString.Trim > _sInPrimaryStatus = > record.Groups("PrimaryStatus").ToString.Trim > _sInMaritalStatus = > record.Groups("MaritalStatus").ToString.Trim > 'REMOVE > DisplayToConsole(record) > 'END REMOVE > Next > Else > inStream.Close() > End If > > Dim finishTime = DateTime.Now.Subtract(startTime).TotalSeconds > Console.WriteLine() > Console.WriteLine("Processed {0} records in {1} seconds.", > totalRecords, finishTime) > Console.ReadLine() > End Sub > Sub DisplayToConsole(ByVal record As Match) > _sInActionCode = record.Groups("ActionCode").ToString.Trim > _sInCarrierID = record.Groups("CarrierID").ToString.Trim > _sInLastName = record.Groups("LastName").ToString.Trim > _sInFirstName = record.Groups("FirstName").ToString.Trim > _sInMiddleName = record.Groups("MiddleName").ToString.Trim > _sInAddr1 = record.Groups("Addr1").ToString.Trim > _sInAddr2 = record.Groups("Addr2").ToString.Trim > _sInCity = record.Groups("City").ToString.Trim > _sInState = record.Groups("State").ToString.Trim > _sInZip = record.Groups("Zip").ToString.Trim > _sInBenefitOption = record.Groups("BenefitOption").ToString.Trim > _sInEmployerGroup = record.Groups("EmployerGroup").ToString.Trim > _sInOptionEffDate = record.Groups("OptionEffDate").ToString.Trim > _sInHPEffDate = record.Groups("HPEffDate").ToString.Trim > _sInTermDate = record.Groups("TermDate").ToString.Trim > _sInSex = record.Groups("Sex").ToString.Trim > _sInDOB = record.Groups("DOB").ToString.Trim > _sInSSN = record.Groups("SSN").ToString.Trim > _sInPhone = record.Groups("Phone").ToString.Trim > _sInEmployerGroupAnivDate = > record.Groups("EmployerGroupAnivDate").ToString.Trim() > _sInHeadOfHouse = record.Groups("HeadOfHouse").ToString.Trim > _sInPrimaryStatus = record.Groups("PrimaryStatus").ToString.Trim > _sInMaritalStatus = record.Groups("MaritalStatus").ToString.Trim > > Console.WriteLine() > Console.WriteLine("_sInActionCode = " & _sInActionCode) > Console.WriteLine("_sInCarrierID = " & _sInCarrierID) > Console.WriteLine("_sInLastName = " & _sInLastName) > Console.WriteLine("_sInFirstName = " & _sInFirstName) > Console.WriteLine("_sInMiddleName = " & _sInMiddleName) > Console.WriteLine("_sInAddr1 = " & _sInAddr1) > Console.WriteLine("_sInAddr2 = " & _sInAddr2) > Console.WriteLine("_sInCity = " & _sInCity) > Console.WriteLine("_sInState = " & _sInState) > Console.WriteLine("_sInZip = " & _sInZip) > Console.WriteLine("_sInBenefitOption = " & _sInBenefitOption) > Console.WriteLine("_sInEmployerGroup = " & _sInEmployerGroup) > Console.WriteLine("_sInOptionEffDate = " & _sInOptionEffDate) > Console.WriteLine("_sInHPEffDate = " & _sInHPEffDate) > Console.WriteLine("_sInTermDate = " & _sInTermDate) > Console.WriteLine("_sInSex = " & _sInSex) > Console.WriteLine("_sInDOB = " & _sInDOB) > Console.WriteLine("_sInSSN = " & _sInSSN) > Console.WriteLine("_sInPhone = " & _sInPhone) > Console.WriteLine("_sInEmployerGroupAnivDate = " & > _sInEmployerGroupAnivDate) > Console.WriteLine("_sInHeadOfHouse = " & _sInHeadOfHouse) > Console.WriteLine("_sInPrimaryStatus = " & _sInPrimaryStatus) > Console.WriteLine("_sInMaritalStatus = " & _sInMaritalStatus) > End Sub > End Module "Stephany Young" <noone@localhost> wrote in So...... are we saying that the core of his parsing code was the fastest to news:#8CTfEbNFHA.3296@TK2MSFTNGP15.phx.gbl: begin with, give or take a trim. I won't argue that. I started with the impression that that many reads was causing a significant overhead, and looked for a solution that required less reads. As far as comparitive results, I think we have them, if your best time was 1.5 and mine 3.5 then the results are in:) For a box to box test, just post the code and I'll run it here and let you know the results (and hillcountry if he's still reading this thread:) Night, MP Hi, MeltingPoint.
I attempted to send you an email about 18 hours. The address I used is one I interpreted from and earlier post in this thread. Did you get or dor I misinterpret the address? Show quoteHide quote "MeltingPoint" <n***@all.com> wrote in message news:0J6dnR_gUoZzCNbfRVn-rA@rogers.com... > "Stephany Young" <noone@localhost> wrote in > news:#8CTfEbNFHA.3296@TK2MSFTNGP15.phx.gbl: > > So...... are we saying that the core of his parsing code was the fastest > to > begin with, give or take a trim. I won't argue that. I started with the > impression that that many reads was causing a significant overhead, and > looked for a solution that required less reads. As far as comparitive > results, I think we have them, if your best time was 1.5 and mine 3.5 then > the results are in:) For a box to box test, just post the code and I'll > run > it here and let you know the results (and hillcountry if he's still > reading > this thread:) > > Night, > MP |
|||||||||||||||||||||||