|
web
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
[STRING] extract a word and text around itI need to extract a word and few text that precedes and follows it (about 30 + 30 chars) from a long textual document. Like the description that Google returns when it has found a given word. In example from: "Sylvia Brunner, a marine mammals researcher at the museum in Fairbanks, identified the decomposing carcass and oversaw its recovery on Wednesday. The "bloated, black thing on the beach" was about 12 feet from the river's edge, she said." I have to find the 'carcass' word and finally return with: "identified the decomposing carcass and oversaw its recovery on" --- Which is the *fast* method in VbNet? In VB6 I would have used InStr (with Binary option because faster) to find the position of the word, then Mid to extract the preceding text, then Mid to extract the following text, then build up my phrase in this way: text1 & word & text2 . Any suggestion in VBNet ? New methods, StringBuilder, Regular Expression... or what else? -------- Thanks (examples are obviously very appreciated ;-) ) teo,
| Which is the *fast* method in VbNet? It sounds like you have the methods identified, you simply want someone else to test them for you. Why not test them yourself, as you probably already have the situation (program) and data to test them with. I would probably use a regular expression, as regex feels like the "correct" solution (not necessarily the fastest method). The trick is going to be ensuring that it is an efficient expression and not a poorly performing one... For example using a lazy compare instead of a greedy compare on the 30 before & after... If I have time later I will see what RegEx I can come up with... -- Show quoteHide quoteHope this helps Jay B. Harlow [MVP - Outlook] ..NET Application Architect, Enthusiast, & Evangelist T.S. Bradley - http://www.tsbradley.net "teo" <t**@inwind.it> wrote in message news:ue34929onkna6ch3phs086vbg7oml7slel@4ax.com... | hallo, | | I need to extract a word and few text that | precedes and follows it (about 30 + 30 chars) | from a long textual document. | | Like the description that Google returns when | it has found a given word. | | In example from: | | "Sylvia Brunner, a marine mammals researcher at the museum in Fairbanks, | identified the decomposing carcass and oversaw its recovery on Wednesday. | The "bloated, black thing on the beach" was about 12 feet from the river's | edge, she said." | | I have to find the 'carcass' word | and finally return with: | "identified the decomposing carcass and oversaw its recovery on" | | --- | | Which is the *fast* method in VbNet? | | In VB6 I would have used | InStr (with Binary option because faster) | to find the position of the word, | then Mid to extract the preceding text, | then Mid to extract the following text, | then build up my phrase in this way: text1 & word & text2 . | | Any suggestion in VBNet ? | New methods, StringBuilder, Regular Expression... or what else? | | -------- | | | Thanks | (examples are obviously very appreciated ;-) ) | Use the same method as you would in VB6. Use the IndexOf method to find
the string and the Substring method to extract the part of the text. However, I don't see the reason for getting the preceding text and following text, just to put them together, when the string that you want already exists in the text. teo wrote: Show quoteHide quote > hallo, > > I need to extract a word and few text that > precedes and follows it (about 30 + 30 chars) > from a long textual document. > > Like the description that Google returns when > it has found a given word. > > In example from: > > "Sylvia Brunner, a marine mammals researcher at the museum in Fairbanks, > identified the decomposing carcass and oversaw its recovery on Wednesday. > The "bloated, black thing on the beach" was about 12 feet from the river's > edge, she said." > > I have to find the 'carcass' word > and finally return with: > "identified the decomposing carcass and oversaw its recovery on" > > --- > > Which is the *fast* method in VbNet? > > In VB6 I would have used > InStr (with Binary option because faster) > to find the position of the word, > then Mid to extract the preceding text, > then Mid to extract the following text, > then build up my phrase in this way: text1 & word & text2 . > > Any suggestion in VBNet ? > New methods, StringBuilder, Regular Expression... or what else? > > -------- > > > Thanks > (examples are obviously very appreciated ;-) ) > On Sat, 17 Jun 2006 23:12:22 +0200, Göran Andersson <gu***@guffa.com>
wrote: >Use the same method as you would in VB6. Use the IndexOf method to find The fact is that I'm building a searching engine, and I need to >the string and the Substring method to extract the part of the text. > >However, I don't see the reason for getting the preceding text and >following text, just to put them together, when the string that you want >already exists in the text. format the searched word as Bold, so I'm compelled to have two chunk of text , so I can format my final string as this: plain Text1 + Bold word + plain Text2. Because I have to extract the integral text from a column of a DB (then extract only a part of it, as described above), do you know if SQL syntax is able to perform such extraction? Or I'm compelled to extract the string using the VB methods after having stored the integral text in a DataReader? | The fact is that I'm building a searching engine, and I need to Rather then search for the text each time, have you considered, "indexing" | format the searched word as Bold, each document. Then when you need to do a search, you simply check the index, the index would return where in the text the word was found. -- Show quoteHide quoteHope this helps Jay B. Harlow [MVP - Outlook] ..NET Application Architect, Enthusiast, & Evangelist T.S. Bradley - http://www.tsbradley.net "teo" <t**@inwind.it> wrote in message news:su1992hi7i1jqe71ks1m7k0gp0ust1cd9e@4ax.com... | On Sat, 17 Jun 2006 23:12:22 +0200, Göran Andersson <gu***@guffa.com> | wrote: | | >Use the same method as you would in VB6. Use the IndexOf method to find | >the string and the Substring method to extract the part of the text. | > | >However, I don't see the reason for getting the preceding text and | >following text, just to put them together, when the string that you want | >already exists in the text. | | The fact is that I'm building a searching engine, and I need to | format the searched word as Bold, | so I'm compelled to have two chunk of text , | so I can format my final string as this: | plain Text1 + Bold word + plain Text2. | | Because I have to extract the integral text from a column of a DB | (then extract only a part of it, as described above), | do you know if SQL syntax is able to perform such extraction? | | Or I'm compelled to extract the string using the VB methods | after having stored the integral text in a DataReader? | On Sat, 17 Jun 2006 22:42:07 -0500, "Jay B. Harlow [MVP - Outlook]"
<Jay_Harlow_***@tsbradley.net> wrote: >| The fact is that I'm building a searching engine, and I need to I know that there is such an option,>| format the searched word as Bold, >Rather then search for the text each time, have you considered, "indexing" >each document. > >Then when you need to do a search, you simply check the index, the index >would return where in the text the word was found. but I didn't think about it about a solution because it seemed to me that it would have required a lot of job to do firstly ; also going to search a given word among the resulting huge list of indexed words, I think it would require a lot of time, maybe the same time than it would require searching for the given every time. I'm only guessing about, I've no benchmark.... Maybe I'm going to implement such a solution when I've finished this method I've started to develop now. ---------- Another question: the RegExp sample we discuss above returns 30 + 30 , regardless how the 30 on the left start. I'll try to explain what I mean. I'd like to have the chunk of text on the left starting where the sentence containing the given word starts, (so to have the very first letter capitalized), like the way Google displays the results, that is, if you search 'Lewinsky' Goggle returns with: To maintain the *Lewinsky* Story's original feel, we will leave much of this ... These were gifts the president had originally given to Ms. Lewinsky himself. ... In this way, the 'T' letter is at # -16 position, I renounce to the preceding 14 chars and decide to start straight at # - 16 and decide to increase the chunk of text on the right to 44 ( = 30 + 14). If the "T" isn't within the 30 chars on left, no problem, I accept the old 30+30 solution. Is it possible? ----- Basically, we need to trace of the . (= dot) char that signals to us that a sentence (within to 30 left) is going to start. If a single RegExp doesn't workt, we could maybe go with doubling the first RegExp (60 +60) and then with a second RegExp find the dot char and then simply extract the following 60 chars chunk of text on the right. What about this? Teo,
Here is a regex: Dim input As String = "Sylvia Brunner, a marine mammals researcher at the museum in Fairbanks, identified the decomposing carcass and oversaw its recovery on Wednesday. The ""bloated, black thing on the beach"" was about 12 feet from the river's edge, she said." Dim pattern As String = ".{1,30}?carcass.{1,30}" Dim match As Match = Regex.Match(input, pattern, RegexOptions.Multiline) If match.Success Then Debug.WriteLine(match.Value) End If -- Show quoteHide quoteHope this helps Jay B. Harlow [MVP - Outlook] ..NET Application Architect, Enthusiast, & Evangelist T.S. Bradley - http://www.tsbradley.net "teo" <t**@inwind.it> wrote in message news:ue34929onkna6ch3phs086vbg7oml7slel@4ax.com... | hallo, | | I need to extract a word and few text that | precedes and follows it (about 30 + 30 chars) | from a long textual document. | | Like the description that Google returns when | it has found a given word. | | In example from: | | "Sylvia Brunner, a marine mammals researcher at the museum in Fairbanks, | identified the decomposing carcass and oversaw its recovery on Wednesday. | The "bloated, black thing on the beach" was about 12 feet from the river's | edge, she said." | | I have to find the 'carcass' word | and finally return with: | "identified the decomposing carcass and oversaw its recovery on" | | --- | | Which is the *fast* method in VbNet? | | In VB6 I would have used | InStr (with Binary option because faster) | to find the position of the word, | then Mid to extract the preceding text, | then Mid to extract the following text, | then build up my phrase in this way: text1 & word & text2 . | | Any suggestion in VBNet ? | New methods, StringBuilder, Regular Expression... or what else? | | -------- | | | Thanks | (examples are obviously very appreciated ;-) ) | Thanks;
I didn't want you to do the work for me indeed, I only liked to know the name of the functions it is advisable to use for this case... Show quoteHide quote >Teo, >Here is a regex: > > Dim input As String = "Sylvia Brunner, a marine mammals researcher >at the museum in Fairbanks, identified the decomposing carcass and oversaw >its recovery on Wednesday. The ""bloated, black thing on the beach"" was >about 12 feet from the river's edge, she said." > Dim pattern As String = ".{1,30}?carcass.{1,30}" > > Dim match As Match = Regex.Match(input, pattern, >RegexOptions.Multiline) > > If match.Success Then > Debug.WriteLine(match.Value) > End If |
|||||||||||||||||||||||