Home All Groups Group Topic Archive Search About

Parse text into words?

Author
22 Jun 2006 9:10 PM
jim_adams
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries.  Words will then be added to an array as long as they
haven't already been added.  Splitting on a space is a bit too basic
since punctuation will remain.  Maybe regex?

Thanks for any insights.

Jim

Author
22 Jun 2006 11:08 PM
Scott M.
Why can't you split on the space and replace the punctuation (since there
will only be a limited amount of types of punctuation) with nothing?  This
seems to be the most efficient and simple way to do it.

Dim x As String = veryLargeString

y = y.Replace(", "," ")
y = y.Replace(". "," ")
y = y.Replace(": "," ")
y = y.Replace("; "," ")

Dim y As Array = x.Split(" ")

<jim_ad***@hotmail.com> wrote in message
Show quoteHide quote
news:1151010649.773837.202970@u72g2000cwu.googlegroups.com...
>I need a very efficient way to parse large amounts of text (GBs) on
> word boundaries.  Words will then be added to an array as long as they
> haven't already been added.  Splitting on a space is a bit too basic
> since punctuation will remain.  Maybe regex?
>
> Thanks for any insights.
>
> Jim
>
Author
23 Jun 2006 5:11 AM
Cor Ligthert [MVP]
Scott,

In past I have suggested this as a kind of 7th alternative (more for fun).

It works but it is slow with hug strings, even slower than Regex.

(We have tested this ones in this newsgroup, maybe you remember it you again
now I write this).

Cor

Show quoteHide quote
"Scott M." <s-mar@nospam.nospam> schreef in bericht
news:OR7AtBllGHA.4080@TK2MSFTNGP03.phx.gbl...
> Why can't you split on the space and replace the punctuation (since there
> will only be a limited amount of types of punctuation) with nothing?  This
> seems to be the most efficient and simple way to do it.
>
> Dim x As String = veryLargeString
>
> y = y.Replace(", "," ")
> y = y.Replace(". "," ")
> y = y.Replace(": "," ")
> y = y.Replace("; "," ")
>
> Dim y As Array = x.Split(" ")
>
> <jim_ad***@hotmail.com> wrote in message
> news:1151010649.773837.202970@u72g2000cwu.googlegroups.com...
>>I need a very efficient way to parse large amounts of text (GBs) on
>> word boundaries.  Words will then be added to an array as long as they
>> haven't already been added.  Splitting on a space is a bit too basic
>> since punctuation will remain.  Maybe regex?
>>
>> Thanks for any insights.
>>
>> Jim
>>
>
>
Author
23 Jun 2006 5:08 AM
Cor Ligthert [MVP]
Jim,

If I understand you well will be the combination of the VB method Instr and
a sortedlist be the quickest way to achieve what you want.

You go than through your text and when found in a loop you update everytime
the starting point fron instr while you set the word you found in the key of
the dictionary pair of the sortedlist

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vblr7/html/vafctinstr.asp

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfsystemcollectionssortedlistclasstopic.asp

From Regex you can be from one thing sure, it will take probably at least 50
times more time than as above as above.

I hope this helps,

Cor


<jim_ad***@hotmail.com> schreef in bericht
Show quoteHide quote
news:1151010649.773837.202970@u72g2000cwu.googlegroups.com...
>I need a very efficient way to parse large amounts of text (GBs) on
> word boundaries.  Words will then be added to an array as long as they
> haven't already been added.  Splitting on a space is a bit too basic
> since punctuation will remain.  Maybe regex?
>
> Thanks for any insights.
>
> Jim
>
Author
23 Jun 2006 3:38 PM
jim_adams
Hi Cor,

Thanks for the tip.  I was always under the impression that doing
string parsing in a loop was very inefficient, and that regex was the
"enlightened" way.

My first hunch would have been to:

1) replace punctuation with spaces
2) split on spaces
3) step through the array one by one doing a binarysearch off a sorted
array.

Maybe I should go down this brute force route.

Thanks,

Jim


Cor Ligthert [MVP] wrote:
Show quoteHide quote
> Jim,
>
> If I understand you well will be the combination of the VB method Instr and
> a sortedlist be the quickest way to achieve what you want.
>
> You go than through your text and when found in a loop you update everytime
> the starting point fron instr while you set the word you found in the key of
> the dictionary pair of the sortedlist
>
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vblr7/html/vafctinstr.asp
>
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfsystemcollectionssortedlistclasstopic.asp
>
> From Regex you can be from one thing sure, it will take probably at least 50
> times more time than as above as above.
>
> I hope this helps,
>
> Cor
>
>
> <jim_ad***@hotmail.com> schreef in bericht
> news:1151010649.773837.202970@u72g2000cwu.googlegroups.com...
> >I need a very efficient way to parse large amounts of text (GBs) on
> > word boundaries.  Words will then be added to an array as long as they
> > haven't already been added.  Splitting on a space is a bit too basic
> > since punctuation will remain.  Maybe regex?
> >
> > Thanks for any insights.
> >
> > Jim
> >
Author
23 Jun 2006 1:43 PM
guy
Jim,
is it essential that ALL words are added into your array? if not you could
probably optimise this by only doing the first few GB, maybe check to see how
many words have been added for each GB or 10000 words or whatever.

my bet is that you will quite quickly find that you are adding very few
words, and these will be hightly specialized ones, therefore you only need to
read the first few GB

hth

guy

Show quoteHide quote
"jim_ad***@hotmail.com" wrote:

> I need a very efficient way to parse large amounts of text (GBs) on
> word boundaries.  Words will then be added to an array as long as they
> haven't already been added.  Splitting on a space is a bit too basic
> since punctuation will remain.  Maybe regex?
>
> Thanks for any insights.
>
> Jim
>
>
Author
23 Jun 2006 3:33 PM
jim_adams
I need a list of unique words among all documents.  Since many of the
documents will contain technical terms, now and then it's likely that a
new term will pop up.

guy wrote:
Show quoteHide quote
> Jim,
> is it essential that ALL words are added into your array? if not you could
> probably optimise this by only doing the first few GB, maybe check to see how
> many words have been added for each GB or 10000 words or whatever.
>
> my bet is that you will quite quickly find that you are adding very few
> words, and these will be hightly specialized ones, therefore you only need to
> read the first few GB