|
web
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Re: Parse text into words?> jim_ad***@hotmail.com wrote: Regex is overkill for this problem, and for gigabytes of text we need> > I need a very efficient way to parse large amounts of text (GBs) on > > word boundaries. Words will then be added to an array as long as they > > haven't already been added. Splitting on a space is a bit too basic > > since punctuation will remain. Maybe regex? > > You've got a few choices. The Regex split can do what you want; just > split on [ ,.!?;:]. You could also define a regex for your words and > use Matches(). to think about performance slightly earlier than we normally would. > No _lexical_ analysis is involved here - all we are doing is parsing.> The other option is to write a lexical analyzer (lexer). There might > be some .Net equivalents of the old reliable Lex and Flex. Not sure if > they'd be faster in this case, and seem like massive over kill to me. > > Or if you're really insane, you can hand-write a lexical analyzer. :-) This seems to me to be the simplest approach: - Get the text into a Char array - Procees through this array one Char at a time, maintaining an initially-empty 'current word' - When a character is read: - - if it is a letter character, append it to the 'current word' - - if it is not a letter character, the 'current word' is complete: process it, and reset the 'current word' to the empty string Done. -- Larry Lard Replies to group please Larry Lard wrote:
> Travers Naran wrote: Have you tested the performance yet? Because a pre-compiled regex can> > You've got a few choices. The Regex split can do what you want; just > > split on [ ,.!?;:]. You could also define a regex for your words and > > use Matches(). > > Regex is overkill for this problem, and for gigabytes of text we need > to think about performance slightly earlier than we normally would. be surprisingly fast. > > Or if you're really insane, you can hand-write a lexical analyzer. :-) Um, that IS lexical analysis.> > No _lexical_ analysis is involved here - all we are doing is parsing. > This seems to me to be the simplest approach: > > - Get the text into a Char array > - Procees through this array one Char at a time, maintaining an > initially-empty 'current word' > - When a character is read: > - - if it is a letter character, append it to the 'current word' > - - if it is not a letter character, the 'current word' is complete: > process it, and reset the 'current word' to the empty string Travers Naran wrote:
> Larry Lard wrote: Sure, but is it going to be faster than the below?> > Travers Naran wrote: > > > You've got a few choices. The Regex split can do what you want; just > > > split on [ ,.!?;:]. You could also define a regex for your words and > > > use Matches(). > > > > Regex is overkill for this problem, and for gigabytes of text we need > > to think about performance slightly earlier than we normally would. > > Have you tested the performance yet? Because a pre-compiled regex can > be surprisingly fast. Show quoteHide quote > My mistake.> > > Or if you're really insane, you can hand-write a lexical analyzer. :-) > > > > No _lexical_ analysis is involved here - all we are doing is parsing. > > This seems to me to be the simplest approach: > > > > - Get the text into a Char array > > - Procees through this array one Char at a time, maintaining an > > initially-empty 'current word' > > - When a character is read: > > - - if it is a letter character, append it to the 'current word' > > - - if it is not a letter character, the 'current word' is complete: > > process it, and reset the 'current word' to the empty string > > Um, that IS lexical analysis. -- Larry Lard Replies to group please
Type xxx not defined #2
Trying to understand API calls If you connect to Oracle through .NET please help! change progressbar bar color dataadapter and stored procs in design time get first 50 characters How to password protect a folder or similar Drawing vertical string Problems with structure in structure and DLL function call in VB.NET Problems with structure in structure and DLL function call in VB.NET |
|||||||||||||||||||||||