Home All Groups Group Topic Archive Search About

Re: Parse text into words?

Author
23 Jun 2006 1:30 PM
Larry Lard
Travers Naran wrote:
> jim_ad***@hotmail.com wrote:
> > I need a very efficient way to parse large amounts of text (GBs) on
> > word boundaries.  Words will then be added to an array as long as they
> > haven't already been added.  Splitting on a space is a bit too basic
> > since punctuation will remain.  Maybe regex?
>
> You've got a few choices.  The Regex split can do what you want; just
> split on [ ,.!?;:].  You could also define a regex for your words and
> use Matches().

Regex is overkill for this problem, and for gigabytes of text we need
to think about performance slightly earlier than we normally would.

>
> The other option is to write a lexical analyzer (lexer).  There might
> be some .Net equivalents of the old reliable Lex and Flex.  Not sure if
> they'd be faster in this case, and seem like massive over kill to me.
>
> Or if you're really insane, you can hand-write a lexical analyzer. :-)

No _lexical_ analysis is involved here - all we are doing is parsing.
This seems to me to be the simplest approach:

- Get the text into a Char array
- Procees through this array one Char at a time, maintaining an
initially-empty 'current word'
- When a character is read:
- - if it is a letter character, append it to the 'current word'
- - if it is not a letter character, the 'current word' is complete:
process it, and reset the 'current word' to the empty string

Done.

--
Larry Lard
Replies to group please

Author
23 Jun 2006 4:21 PM
Travers Naran
Larry Lard wrote:
> Travers Naran wrote:
> > You've got a few choices.  The Regex split can do what you want; just
> > split on [ ,.!?;:].  You could also define a regex for your words and
> > use Matches().
>
> Regex is overkill for this problem, and for gigabytes of text we need
> to think about performance slightly earlier than we normally would.

Have you tested the performance yet?  Because a pre-compiled regex can
be surprisingly fast.

> > Or if you're really insane, you can hand-write a lexical analyzer. :-)
>
> No _lexical_ analysis is involved here - all we are doing is parsing.
> This seems to me to be the simplest approach:
>
> - Get the text into a Char array
> - Procees through this array one Char at a time, maintaining an
> initially-empty 'current word'
> - When a character is read:
> - - if it is a letter character, append it to the 'current word'
> - - if it is not a letter character, the 'current word' is complete:
> process it, and reset the 'current word' to the empty string

Um, that IS lexical analysis.
Author
23 Jun 2006 11:45 PM
Larry Lard
Travers Naran wrote:
> Larry Lard wrote:
> > Travers Naran wrote:
> > > You've got a few choices.  The Regex split can do what you want; just
> > > split on [ ,.!?;:].  You could also define a regex for your words and
> > > use Matches().
> >
> > Regex is overkill for this problem, and for gigabytes of text we need
> > to think about performance slightly earlier than we normally would.
>
> Have you tested the performance yet?  Because a pre-compiled regex can
> be surprisingly fast.

Sure, but is it going to be faster than the below?

Show quoteHide quote
>
> > > Or if you're really insane, you can hand-write a lexical analyzer. :-)
> >
> > No _lexical_ analysis is involved here - all we are doing is parsing.
> > This seems to me to be the simplest approach:
> >
> > - Get the text into a Char array
> > - Procees through this array one Char at a time, maintaining an
> > initially-empty 'current word'
> > - When a character is read:
> > - - if it is a letter character, append it to the 'current word'
> > - - if it is not a letter character, the 'current word' is complete:
> > process it, and reset the 'current word' to the empty string
>
> Um, that IS lexical analysis.

My mistake.

--
Larry Lard
Replies to group please