|
web
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Parse text into words?I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they haven't already been added. Splitting on a space is a bit too basic since punctuation will remain. Maybe regex? Thanks for any insights. Jim Why can't you split on the space and replace the punctuation (since there
will only be a limited amount of types of punctuation) with nothing? This seems to be the most efficient and simple way to do it. Dim x As String = veryLargeString y = y.Replace(", "," ") y = y.Replace(". "," ") y = y.Replace(": "," ") y = y.Replace("; "," ") Dim y As Array = x.Split(" ") <jim_ad***@hotmail.com> wrote in message Show quoteHide quote news:1151010649.773837.202970@u72g2000cwu.googlegroups.com... >I need a very efficient way to parse large amounts of text (GBs) on > word boundaries. Words will then be added to an array as long as they > haven't already been added. Splitting on a space is a bit too basic > since punctuation will remain. Maybe regex? > > Thanks for any insights. > > Jim > Scott,
In past I have suggested this as a kind of 7th alternative (more for fun). It works but it is slow with hug strings, even slower than Regex. (We have tested this ones in this newsgroup, maybe you remember it you again now I write this). Cor Show quoteHide quote "Scott M." <s-mar@nospam.nospam> schreef in bericht news:OR7AtBllGHA.4080@TK2MSFTNGP03.phx.gbl... > Why can't you split on the space and replace the punctuation (since there > will only be a limited amount of types of punctuation) with nothing? This > seems to be the most efficient and simple way to do it. > > Dim x As String = veryLargeString > > y = y.Replace(", "," ") > y = y.Replace(". "," ") > y = y.Replace(": "," ") > y = y.Replace("; "," ") > > Dim y As Array = x.Split(" ") > > <jim_ad***@hotmail.com> wrote in message > news:1151010649.773837.202970@u72g2000cwu.googlegroups.com... >>I need a very efficient way to parse large amounts of text (GBs) on >> word boundaries. Words will then be added to an array as long as they >> haven't already been added. Splitting on a space is a bit too basic >> since punctuation will remain. Maybe regex? >> >> Thanks for any insights. >> >> Jim >> > > Jim,
If I understand you well will be the combination of the VB method Instr and a sortedlist be the quickest way to achieve what you want. You go than through your text and when found in a loop you update everytime the starting point fron instr while you set the word you found in the key of the dictionary pair of the sortedlist http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vblr7/html/vafctinstr.asp http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfsystemcollectionssortedlistclasstopic.asp From Regex you can be from one thing sure, it will take probably at least 50 times more time than as above as above. I hope this helps, Cor <jim_ad***@hotmail.com> schreef in bericht Show quoteHide quote news:1151010649.773837.202970@u72g2000cwu.googlegroups.com... >I need a very efficient way to parse large amounts of text (GBs) on > word boundaries. Words will then be added to an array as long as they > haven't already been added. Splitting on a space is a bit too basic > since punctuation will remain. Maybe regex? > > Thanks for any insights. > > Jim > Hi Cor,
Thanks for the tip. I was always under the impression that doing string parsing in a loop was very inefficient, and that regex was the "enlightened" way. My first hunch would have been to: 1) replace punctuation with spaces 2) split on spaces 3) step through the array one by one doing a binarysearch off a sorted array. Maybe I should go down this brute force route. Thanks, Jim Cor Ligthert [MVP] wrote: Show quoteHide quote > Jim, > > If I understand you well will be the combination of the VB method Instr and > a sortedlist be the quickest way to achieve what you want. > > You go than through your text and when found in a loop you update everytime > the starting point fron instr while you set the word you found in the key of > the dictionary pair of the sortedlist > > http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vblr7/html/vafctinstr.asp > > http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfsystemcollectionssortedlistclasstopic.asp > > From Regex you can be from one thing sure, it will take probably at least 50 > times more time than as above as above. > > I hope this helps, > > Cor > > > <jim_ad***@hotmail.com> schreef in bericht > news:1151010649.773837.202970@u72g2000cwu.googlegroups.com... > >I need a very efficient way to parse large amounts of text (GBs) on > > word boundaries. Words will then be added to an array as long as they > > haven't already been added. Splitting on a space is a bit too basic > > since punctuation will remain. Maybe regex? > > > > Thanks for any insights. > > > > Jim > > Jim,
is it essential that ALL words are added into your array? if not you could probably optimise this by only doing the first few GB, maybe check to see how many words have been added for each GB or 10000 words or whatever. my bet is that you will quite quickly find that you are adding very few words, and these will be hightly specialized ones, therefore you only need to read the first few GB hth guy Show quoteHide quote "jim_ad***@hotmail.com" wrote: > I need a very efficient way to parse large amounts of text (GBs) on > word boundaries. Words will then be added to an array as long as they > haven't already been added. Splitting on a space is a bit too basic > since punctuation will remain. Maybe regex? > > Thanks for any insights. > > Jim > > I need a list of unique words among all documents. Since many of the
documents will contain technical terms, now and then it's likely that a new term will pop up. guy wrote: Show quoteHide quote > Jim, > is it essential that ALL words are added into your array? if not you could > probably optimise this by only doing the first few GB, maybe check to see how > many words have been added for each GB or 10000 words or whatever. > > my bet is that you will quite quickly find that you are adding very few > words, and these will be hightly specialized ones, therefore you only need to > read the first few GB
variable declaration ?
Unicode API How to assign a state to checkbox in Visual Basic.net ? Paramters passing and RunWorkerCompleted event Accessing Login info Programatically create a Stored Procedure Find out child type in base class call? Source object on right click Line 1: Incorrect syntax near '1'. App.config and changing connection strings |
|||||||||||||||||||||||