|
web
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Regular expression rejecting invalid filesI am using a regular expression to read records from a text file. But when reading files with invalid formats it takes ages before the program rjects the file. So I want to optimise the expression to reject invalid files faster. The valid files are wellformed and looks something like this: Once upon a time CODE NUMBER:123 There was a little lamb CODE NUMBER:2134 Each record is terminated by a form feed and the reg-expression is something like this: ..Pattern = "(.*)\r\nCODE NUMBER:(\d+)\r\n\f" Any ideas on how to speed up file rejection? Regards Bertrand Since you have newlines embedded in your regex, you are obviously
reading the entire file in before comparing it to your pattern, rather than reading a record at a time. Then your regex has to look through the whole thing to see if your pattern is there, Since binary files can be large, you're probably taking a hit on the file I/O, and then another hit on the regex. I would probably try reading a dozen bytes from the beginning of each file, and make sure each of the characters I got was alphanumeric, whitespace, or some small set of punctuation. If it was, I'd go to the full I/O and regex; if not, I'd assume I had a binary file and go on to the next one. Thanks, that is one way to do it. It does not seem to be I/O though. I think
that perhaps my expression is to "loose" in the sence that it does not include file start/end symbols. If I could only make the grammar more strict then I would presume that files would be sooner rejected. Any ideas whether this is possible? Regards Bertrand Show quoteHide quote "spamsickle@gmail.com" wrote: > Since you have newlines embedded in your regex, you are obviously > reading the entire file in before comparing it to your pattern, rather > than reading a record at a time. Then your regex has to look through > the whole thing to see if your pattern is there, Since binary files > can be large, you're probably taking a hit on the file I/O, and then > another hit on the regex. > > I would probably try reading a dozen bytes from the beginning of each > file, and make sure each of the characters I got was alphanumeric, > whitespace, or some small set of punctuation. If it was, I'd go to the > full I/O and regex; if not, I'd assume I had a binary file and go on to > the next one. > > Since you're reading the entire file into a string before executing
your regex, the start of file is the start of string. The way your regex is coded, the regex has to go all the way through the file before it can reject it (that (.*) at the beginning). Is it really necessary to capture everything that comes before CODE NUMBER? If it is, you might try something like "^([a-zA-Z ]{5}.*)" in place of your (.*). Without knowing what your "Once upon a time"s really look like, it's kind of hard to say. |
|||||||||||||||||||||||