|
web
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
writing a search text programHi,
I want to write a program to search for a given word in a large text file ( it could have any format ) like a book and high light matchin words . Regular search will be so slow because it might be several hundreds of pages , I was thinking about some sort of indexing but I wasn't sure . Any suggestions? -- Best regards, Edward Edward,
It seems that you want to make a program that nobody yet was able to, and you ask here how to do that? Cor Show quoteHide quote "Edward" <Edw***@discussions.microsoft.com> wrote in message news:F5235BD1-65CA-4D7F-8BC0-E5DBA7B87B8C@microsoft.com... > Hi, > I want to write a program to search for a given word in a large text file > ( > it could have any format ) like a book and high light matchin words . > Regular > search will be so slow because it might be several hundreds of pages , I > was > thinking about some sort of indexing but I wasn't sure . Any suggestions? > -- > Best regards, > Edward "Edward" <Edw***@discussions.microsoft.com> wrote in message The Indexing Service is one option, otherwise you have to make your own. See news:F5235BD1-65CA-4D7F-8BC0-E5DBA7B87B8C@microsoft.com... > Hi, > I want to write a program to search for a given word in a large text file > ( > it could have any format ) like a book and high light matchin words . > Regular > search will be so slow because it might be several hundreds of pages , I > was > thinking about some sort of indexing but I wasn't sure . Any suggestions? here: Programming with Visual Basic http://msdn.microsoft.com/en-us/library/ms692251(VS.85).aspx I am not sure if the samples are for dotnet, or for classic VB. Here is an ASP.Net version: How to use an ASP.NET application to query an Indexing Service catalog by using Visual Basic .NET http://support.microsoft.com/kb/820105 Thanks for all great responses. There are several websites that have
different types of word search for example bibleGeteway.com that you type word and it almost immediately brings up all the verses as an hyperlinked that include that word. I'm trying to write this type of application which is certainly doable , but I don't know what would be the best approch to achive the best speed ? -- Show quoteHide quoteBest regards, Edward "Nobody" wrote: > "Edward" <Edw***@discussions.microsoft.com> wrote in message > news:F5235BD1-65CA-4D7F-8BC0-E5DBA7B87B8C@microsoft.com... > > Hi, > > I want to write a program to search for a given word in a large text file > > ( > > it could have any format ) like a book and high light matchin words . > > Regular > > search will be so slow because it might be several hundreds of pages , I > > was > > thinking about some sort of indexing but I wasn't sure . Any suggestions? > > The Indexing Service is one option, otherwise you have to make your own. See > here: > > Programming with Visual Basic > http://msdn.microsoft.com/en-us/library/ms692251(VS.85).aspx > > I am not sure if the samples are for dotnet, or for classic VB. Here is an > ASP.Net version: > > How to use an ASP.NET application to query an Indexing Service catalog by > using Visual Basic .NET > http://support.microsoft.com/kb/820105 > > > "Edward" <Edw***@discussions.microsoft.com> wrote in message Start here(Copy and Paste the URL):news:26437CDC-9EBC-4756-818B-A47A71003E9F@microsoft.com... > Thanks for all great responses. There are several websites that have > different types of word search for example bibleGeteway.com that you type > word and it almost immediately brings up all the verses as an hyperlinked > that include that word. I'm trying to write this type of application which > is > certainly doable , but I don't know what would be the best approch to > achive > the best speed ? http://en.wikipedia.org/wiki/Index_(search_engine)
Show quote
Hide quote
"Edward" <Edw***@discussions.microsoft.com> wrote in message When you see an app like you describe remember that the search has actually news:26437CDC-9EBC-4756-818B-A47A71003E9F@microsoft.com... > Thanks for all great responses. There are several websites that have > different types of word search for example bibleGeteway.com that you type > word and it almost immediately brings up all the verses as an hyperlinked > that include that word. I'm trying to write this type of application which > is > certainly doable , but I don't know what would be the best approch to > achive > the best speed ? > -- > Best regards, > Edward > > > "Nobody" wrote: > >> "Edward" <Edw***@discussions.microsoft.com> wrote in message >> news:F5235BD1-65CA-4D7F-8BC0-E5DBA7B87B8C@microsoft.com... >> > Hi, >> > I want to write a program to search for a given word in a large text >> > file >> > ( >> > it could have any format ) like a book and high light matchin words . >> > Regular >> > search will be so slow because it might be several hundreds of pages , >> > I >> > was >> > thinking about some sort of indexing but I wasn't sure . Any >> > suggestions? >> >> The Indexing Service is one option, otherwise you have to make your own. >> See >> here: >> >> Programming with Visual Basic >> http://msdn.microsoft.com/en-us/library/ms692251(VS.85).aspx >> >> I am not sure if the samples are for dotnet, or for classic VB. Here is >> an >> ASP.Net version: >> >> How to use an ASP.NET application to query an Indexing Service catalog by >> using Visual Basic .NET >> http://support.microsoft.com/kb/820105 >> >> >> taken place before a user asks. There are indexes present which are created off-line which present the data quickly. Remember this: Index once for speed and then for the rest of time benefit. If the data is dynamic (not an already written text) then you have to way the indexing time vs the search time. Never an easy thing to do. LS "Edward" <Edw***@discussions.microsoft.com> wrote in message Far from impossible, this has been done before. Somewhere, I have a 16kb news:F5235BD1-65CA-4D7F-8BC0-E5DBA7B87B8C@microsoft.com... > Hi, > I want to write a program to search for a given word in a large text file > ( > it could have any format ) like a book and high light matchin words . > Regular > search will be so slow because it might be several hundreds of pages , I > was > thinking about some sort of indexing but I wasn't sure . Any suggestions? > -- > Best regards, > Edward standalone (no dependencies) executable I picked up years ago and it was the fastest bulk search I'd seen until I discovered the speed of the Get statement. My world scripture collection tops 150Mb of raw ascii text - that's 25 million words or the equivalent of 50,000 pages (given an average page size of 500 words per page). In 1998 it took several hours for the fastest search algorithms to scan the collection. Although the three arm bottle neck created when the Windows PageFile shares the same disk platters with user data, and operating system+applications still has a profound effect on speed, faster CPUs, SATAII protocol, and the introduction of the Honeywell memory access system ("DDR") to the general public has sped things up a great deal. In any case, Windows search has always been one of the fastest, and now incorporates an index generated by the Search service. I presume that if ..NET offers access to the Windows Search, this would save you the trouble of writing your own index - although I'd suggest the Search service needs to be house-trained so that, like the disc defragmenter, it only runs when asked or as part of a user initiated maintenance procedure. This leaves us with the question of how to invoke the Windows Search API (the one that utilises the Windows search index, and preferably in a .NET namespace), to return file locations and file access points (byte number for the start of the search string). As these come in, your program could assemble context statements with the search string highlighted within. Good luck... -- Timothy Casey - Email: 5th-prime-num***@timothycasey.info Software: http://software-1011.com; Scientific IQ Test, Web Menus, Security http://web-design-1011.comhttp://speed-reading-comprehension.com Science & Geology: http://geologist-1011.com;http://geologist-1011.net "Timothy Casey" <1*@timothycasey.info> wrote in message it could have any formatnews:1427221F-D17B-40C5-A4E6-513AF12ECC28@microsoft.com... > > Far from impossible, this has been done before. Somewhere, I have a 16kb > standalone (no dependencies) executable I picked up years ago and it was > the fastest bulk search I'd seen until I discovered the speed of the Get > statement. > "Cor Ligthert[MVP]" <Notmyfirstn***@planet.nl> wrote in message You make a good point, Cor. Sometimes key text is buried in images, news:e58tjKQ5JHA.2656@TK2MSFTNGP05.phx.gbl... > > "Timothy Casey" <1*@timothycasey.info> wrote in message > news:1427221F-D17B-40C5-A4E6-513AF12ECC28@microsoft.com... >> >> Far from impossible, this has been done before. Somewhere, I have a 16kb >> standalone (no dependencies) executable I picked up years ago and it was >> the fastest bulk search I'd seen until I discovered the speed of the Get >> statement. >> > it could have any format non-standard binaries, internal file compression, and encryption. This can be really frustrating. However, your point alludes to the most interesting part of the question. The beginning for pulling text out of these other formats in a generic way falls to Natural Language Processing or NLP because language has a mathematical signature that corresponds to the myriad of rules that apply to spelling and grammar. In spite of all the spectacular claims, no-one has NLP - not yet. The foundation of NLP is contextualisation, which has been the focus of languages such as XML. However, as the folks at Brown University soon discovered, there are also issues of core structure versus extensible features of language that vary from node type to node type in structural hierarchy of communication. Did I mention that language is not compatible with well-formed hierarchies due to the frequency of two way ambiguity in word meanings (and often function). Thus context is drawn from structure, which itself could be any one of a number of possibilities that cannot always be resolved from structure. Consider the meaning of the word, "green" in the following examples: 1. The green recruit 2. The green passenger 3. The green corporation 4. The green thumb In each case the meaning of green depends on the definition of the applicable noun. Nobody's clear on a system, and when you compare the effectiveness of .NET as a language unto itself - it emerges that there may well be some errors in the conventional academic perception of linguistic structure. Linguists hold the verb, for example, as an equal classification to the noun when considering parts of speech - but in the Microsoft class system, a verb is meagerly a sub-part of the noun. The Microsoft system works very well, so perhaps the engineering proves they got something right in this department...? In any case, we have a long way to go, even if the data and analyses being accumulated are fascinating. -- Timothy Casey - Email: 5th-prime-num***@timothycasey.info Software: http://software-1011.com; Scientific IQ Test, Web Menus, Security http://web-design-1011.comhttp://speed-reading-comprehension.com Science & Geology: http://geologist-1011.com;http://geologist-1011.net Timothy,
Your reply let me think about this, you see it often(mostly) in my replies. http://en.wikipedia.org/wiki/Dunglish Be aware that all those persons where is spoken about are probably fluent speakers in at least English, French and German beside Dutch Cor Show quoteHide quote "Timothy Casey" <1*@timothycasey.info> wrote in message news:24FE1494-B2BC-4866-B98E-24040E64521C@microsoft.com... > "Cor Ligthert[MVP]" <Notmyfirstn***@planet.nl> wrote in message > news:e58tjKQ5JHA.2656@TK2MSFTNGP05.phx.gbl... >> >> "Timothy Casey" <1*@timothycasey.info> wrote in message >> news:1427221F-D17B-40C5-A4E6-513AF12ECC28@microsoft.com... >>> >>> Far from impossible, this has been done before. Somewhere, I have a 16kb >>> standalone (no dependencies) executable I picked up years ago and it was >>> the fastest bulk search I'd seen until I discovered the speed of the Get >>> statement. >>> >> it could have any format > > You make a good point, Cor. Sometimes key text is buried in images, > non-standard binaries, internal file compression, and encryption. This can > be really frustrating. However, your point alludes to the most interesting > part of the question. > > The beginning for pulling text out of these other formats in a generic way > falls to Natural Language Processing or NLP because language has a > mathematical signature that corresponds to the myriad of rules that apply > to spelling and grammar. In spite of all the spectacular claims, no-one > has NLP - not yet. The foundation of NLP is contextualisation, which has > been the focus of languages such as XML. However, as the folks at Brown > University soon discovered, there are also issues of core structure versus > extensible features of language that vary from node type to node type in > structural hierarchy of communication. Did I mention that language is not > compatible with well-formed hierarchies due to the frequency of two way > ambiguity in word meanings (and often function). Thus context is drawn > from structure, which itself could be any one of a number of possibilities > that cannot always be resolved from structure. Consider the meaning of the > word, "green" in the following examples: > > 1. The green recruit > 2. The green passenger > 3. The green corporation > 4. The green thumb > > In each case the meaning of green depends on the definition of the > applicable noun. > > Nobody's clear on a system, and when you compare the effectiveness of .NET > as a language unto itself - it emerges that there may well be some errors > in the conventional academic perception of linguistic structure. Linguists > hold the verb, for example, as an equal classification to the noun when > considering parts of speech - but in the Microsoft class system, a verb is > meagerly a sub-part of the noun. The Microsoft system works very well, so > perhaps the engineering proves they got something right in this > department...? > > In any case, we have a long way to go, even if the data and analyses being > accumulated are fascinating. > > -- > Timothy Casey - Email: 5th-prime-num***@timothycasey.info > Software: http://software-1011.com; Scientific IQ Test, Web Menus, > Security > http://web-design-1011.comhttp://speed-reading-comprehension.com > Science & Geology: http://geologist-1011.com;http://geologist-1011.net All that is quite correct, but it's not relevant to the problem, and the
task as stated is definately, as you state, far from impossible . Although OP used the term "any format" he also used the terms "large text file" and "given word". So he is not considering language information represented in anything other than plain text, and an indexer does not need to comprehend the file in order to find a match between a 'given word' and some portion of a text file. The respondent is choosing to interpret the question in a way that enables him to avoid addressing the real issue. Show quoteHide quote "Timothy Casey" <1*@timothycasey.info> wrote in message news:24FE1494-B2BC-4866-B98E-24040E64521C@microsoft.com... > "Cor Ligthert[MVP]" <Notmyfirstn***@planet.nl> wrote in message > news:e58tjKQ5JHA.2656@TK2MSFTNGP05.phx.gbl... >> >> "Timothy Casey" <1*@timothycasey.info> wrote in message >> news:1427221F-D17B-40C5-A4E6-513AF12ECC28@microsoft.com... >>> >>> Far from impossible, this has been done before. Somewhere, I have a 16kb >>> standalone (no dependencies) executable I picked up years ago and it was >>> the fastest bulk search I'd seen until I discovered the speed of the Get >>> statement. >>> >> it could have any format > > You make a good point, Cor. Sometimes key text is buried in images, > non-standard binaries, internal file compression, and encryption. This can > be really frustrating. However, your point alludes to the most interesting > part of the question. > > The beginning for pulling text out of these other formats in a generic way > falls to Natural Language Processing or NLP because language has a > mathematical signature that corresponds to the myriad of rules that apply > to spelling and grammar. In spite of all the spectacular claims, no-one > has NLP - not yet. The foundation of NLP is contextualisation, which has > been the focus of languages such as XML. However, as the folks at Brown > University soon discovered, there are also issues of core structure versus > extensible features of language that vary from node type to node type in > structural hierarchy of communication. Did I mention that language is not > compatible with well-formed hierarchies due to the frequency of two way > ambiguity in word meanings (and often function). Thus context is drawn > from structure, which itself could be any one of a number of possibilities > that cannot always be resolved from structure. Consider the meaning of the > word, "green" in the following examples: > > 1. The green recruit > 2. The green passenger > 3. The green corporation > 4. The green thumb > > In each case the meaning of green depends on the definition of the > applicable noun. > > Nobody's clear on a system, and when you compare the effectiveness of .NET > as a language unto itself - it emerges that there may well be some errors > in the conventional academic perception of linguistic structure. Linguists > hold the verb, for example, as an equal classification to the noun when > considering parts of speech - but in the Microsoft class system, a verb is > meagerly a sub-part of the noun. The Microsoft system works very well, so > perhaps the engineering proves they got something right in this > department...? > > In any case, we have a long way to go, even if the data and analyses being > accumulated are fascinating. > > -- > Timothy Casey - Email: 5th-prime-num***@timothycasey.info > Software: http://software-1011.com; Scientific IQ Test, Web Menus, > Security > http://web-design-1011.comhttp://speed-reading-comprehension.com > Science & Geology: http://geologist-1011.com;http://geologist-1011.net "James Hahn" <jh***@yahoo.com> wrote in message Which brings us back to Windows Search and the attached index provided by news:uA1g4Ix5JHA.6004@TK2MSFTNGP02.phx.gbl... > All that is quite correct, but it's not relevant to the problem, and the > task as stated is definately, as you state, far from impossible . > Although OP used the term "any format" he also used the terms "large text > file" and "given word". So he is not considering language information > represented in anything other than plain text, and an indexer does not > need to comprehend the file in order to find a match between a 'given > word' and some portion of a text file. The respondent is choosing to > interpret the question in a way that enables him to avoid addressing the > real issue. the Search Service. Two questions: does anyone know 1. The .NET Namespace necessary to tap into Windows Search? 2. The range of house-training options available to the Windows Registry for the search service? I too would like to know. Thanks in Advance... -- Timothy Casey - Email: 5th-prime-num***@timothycasey.info Software: http://software-1011.com; Scientific IQ Test, Web Menus, Security http://web-design-1011.comhttp://speed-reading-comprehension.com Science & Geology: http://geologist-1011.com;http://geologist-1011.net Timothy Casey wrote:
> Which brings us back to Windows Search and the attached index provided In general, this requires using the Content Indexing Service and COM > by the Search Service. Two questions: does anyone know > > 1. The .NET Namespace necessary to tap into Windows Search? API. Lets see what the .NET wrappers are...... There is a bunch of stuff. I found this example: How to use an ASP.NET application to query an Indexing Service catalog by using Visual Basic .NET http://support.microsoft.com/kb/820105 > 2. The range of house-training options available to the Windows Registry Not sure what that means. Don't you want to ideally want to eliminate > for the search service? > > I too would like to know. any direct Windows Registry usage? --
Show quote
Hide quote
"Mike" <unkn***@unknown.tv> wrote in message Thanksnews:e8cU8%2315JHA.1432@TK2MSFTNGP02.phx.gbl... > Timothy Casey wrote: > >> Which brings us back to Windows Search and the attached index provided by >> the Search Service. Two questions: does anyone know >> >> 1. The .NET Namespace necessary to tap into Windows Search? > > In general, this requires using the Content Indexing Service and COM API. > Lets see what the .NET wrappers are...... > > There is a bunch of stuff. I found this example: > > How to use an ASP.NET application to query an Indexing > Service catalog by using Visual Basic .NET > > http://support.microsoft.com/kb/820105 >> 2. The range of house-training options available to the Windows Registry The Search service comes on often when it's not wanted. Plug in a USB >> for the search service? >> >> I too would like to know. > > Not sure what that means. Don't you want to ideally want to eliminate any > direct Windows Registry usage? (thumb) drive for a 27 second backup and the system tells you that the drive is in use for the next 30 minutes because the Search service has decided to index the drive for the Nth time. This is not good when you are in a hurry, have soewhere else to go, and were not planning to wait 30 minutes for the Windows to release the thumb drive. Other bad habits include hogging resources needed on demand by other program launches (which sometimes leads to a freeze). A search program with an easy means of regulating indexing and giving the user more control would ultimately be a better product. If this can be done through a namespace I'm all ears - otherwise it falls to registry settings does it not? -- Timothy Casey - Email: 5th-prime-num***@timothycasey.info Software: http://software-1011.com; Scientific IQ Test, Web Menus, Security http://web-design-1011.comhttp://speed-reading-comprehension.com Science & Geology: http://geologist-1011.com;http://geologist-1011.net
Show quote
Hide quote
"Mike" <unkn***@unknown.tv> wrote in message How wold this be applied to a desktop application in VB2005?news:e8cU8%2315JHA.1432@TK2MSFTNGP02.phx.gbl... > Timothy Casey wrote: > >> Which brings us back to Windows Search and the attached index provided by >> the Search Service. Two questions: does anyone know >> >> 1. The .NET Namespace necessary to tap into Windows Search? > > In general, this requires using the Content Indexing Service and COM API. > Lets see what the .NET wrappers are...... > > There is a bunch of stuff. I found this example: > > How to use an ASP.NET application to query an Indexing > Service catalog by using Visual Basic .NET > > http://support.microsoft.com/kb/820105 > Also, is there a way to get the program to initiate the building of the catalogue without the user having to know that...? -- Timothy Casey - Email: 5th-prime-num***@timothycasey.info Software: http://software-1011.com; Scientific IQ Test, Web Menus, Security http://web-design-1011.comhttp://speed-reading-comprehension.com Science & Geology: http://geologist-1011.com;http://geologist-1011.net Timothy Casey wrote:
Show quoteHide quote > Hi Timothy,> "Mike" <unkn***@unknown.tv> wrote in message > news:e8cU8%2315JHA.1432@TK2MSFTNGP02.phx.gbl... >> Timothy Casey wrote: >> >>> Which brings us back to Windows Search and the attached index >>> provided by the Search Service. Two questions: does anyone know >>> >>> 1. The .NET Namespace necessary to tap into Windows Search? >> >> In general, this requires using the Content Indexing Service and COM >> API. Lets see what the .NET wrappers are...... >> >> There is a bunch of stuff. I found this example: >> >> How to use an ASP.NET application to query an Indexing >> Service catalog by using Visual Basic .NET >> >> http://support.microsoft.com/kb/820105 >> > > How wold this be applied to a desktop application in VB2005? > Also, is there a way to get the program to initiate the building of the > catalogue without the user having to know that...? First, a small side note. Check your system date or the mail writer system you are using to properly set the localized date or ZULU/GMT what have you date. Your mail is showing up post dated and it skewed the threading or date sort order of incoming mail. Very annoying but I also recommend it because some AVS filtering systems will look for incorrect or posted date mail as a maker of spammers. Just a side note. I didn't think the complete example would be useful, but rather showing how to access the indexing COM API. --
Show quote
Hide quote
"Mike" <unkn***@unknown.tv> wrote in message Thanks for the heads-up.news:u5OIDqE6JHA.1432@TK2MSFTNGP02.phx.gbl... [SNIP] > > Hi Timothy, > > First, a small side note. Check your system date or the mail writer > system you are using to properly set the localized date or ZULU/GMT what > have you date. Your mail is showing up post dated and it skewed the > threading or date sort order of incoming mail. Very annoying but I also > recommend it because some AVS filtering systems will look for incorrect or > posted date mail as a maker of spammers. Just a side note. > [SNIP] Just on that side note, I've checked date/time in both system and CMOS clock. Precisely six minutes and twenty seconds ahead of local time. This is not the first time either. I'd suggest the possibility that for some reason, the server is not adjusting the time of my post by the time difference from my location: it seems that last update, Windows Vista Business took it upon itself to relocate my computer to Canada. Having dragged the computer (and operating system) kicking and screaming back to Australia - how does this come up...? -- Timothy Casey - Email: 5th-prime-num***@timothycasey.info Software: http://software-1011.com; Scientific IQ Test, Web Menus, Security http://web-design-1011.comhttp://speed-reading-comprehension.com Science & Geology: http://geologist-1011.com;http://geologist-1011.net Off topic: Please check your computer date and time zone, your posts seem to
be ahead by few hours... "Nobody" <nob***@nobody.com> wrote in message Sorry, the message was meant for Timothy Casey.news:%23sb5Cu35JHA.1712@TK2MSFTNGP03.phx.gbl... > Off topic: Please check your computer date and time zone, your posts seem > to be ahead by few hours... Thank you Timothy, have you tried turning off the indexing on the USB drive?
-- Show quoteHide quoteTimothy Casey wrote: > "James Hahn" <jh***@yahoo.com> wrote in message > news:uA1g4Ix5JHA.6004@TK2MSFTNGP02.phx.gbl... >> All that is quite correct, but it's not relevant to the problem, and >> the task as stated is definately, as you state, far from impossible . >> Although OP used the term "any format" he also used the terms "large >> text file" and "given word". So he is not considering language >> information represented in anything other than plain text, and an >> indexer does not need to comprehend the file in order to find a match >> between a 'given word' and some portion of a text file. The >> respondent is choosing to interpret the question in a way that >> enables him to avoid addressing the real issue. > > Which brings us back to Windows Search and the attached index provided > by the Search Service. Two questions: does anyone know > > 1. The .NET Namespace necessary to tap into Windows Search? > 2. The range of house-training options available to the Windows Registry > for the search service? > > I too would like to know. > > Thanks in Advance... > |
|||||||||||||||||||||||