Home All Groups Group Topic Archive Search About

A question about a failing regular expression

Author
10 Jun 2009 5:55 PM
Anthony P.
Hello Everyone,

My application needs to parse some HTML. As is usual in HTML parsing,
I just need the data between two HTML tags. So here is my regular
expression:

Dim myRegex2 = New Regex("<td headers=""re2 e1"" align=""right""
valign=""bottom"">" & _
                                          "((.|\n)*?)<sup>",
RegexOptions.IgnoreCase)

Now, this is suppose to get the text between the <td headers tag> and
the <sup> tag. But, instead, it returns the entire tag including all
of the attributes.  What am I doing wrong?

Thanks!

Author
10 Jun 2009 6:01 PM
Anthony Papillion
Sorry, I forgot to add that I am also doing the required myMatch =
myRegex2.Match(sContent) after the expression thereby performing the
match against the string sContent.
Author
11 Jun 2009 2:24 PM
eBob.com
The expression is returning what you have asked for.  Maybe not what you are
interested in, but what you have asked for.

You need to look at what the author of my favorite reference (Balena) calls
"zero width positive/negative look-ahead/behind assertions".  These are
"grouping constructs".  Maybe you could use a "noncapturing group" - I don't
think I've used that construct.

(I'd like to be more specific but I am at the wrong computer at the moment.)

ALSO ... do yourself a favor and get a FREE product named Expresso from
Ultrapico.  It is WONDERFUL for developing regular expressions.

Regular expressions are very useful but not very intuitive.  Ask if you have
further questions.

Good Luck,  Bob


Show quoteHide quote
"Anthony P." <papill***@gmail.com> wrote in message
news:b9ff511d-1961-46ca-9b89-1a6853c02257@o36g2000vbi.googlegroups.com...
> Hello Everyone,
>
> My application needs to parse some HTML. As is usual in HTML parsing,
> I just need the data between two HTML tags. So here is my regular
> expression:
>
> Dim myRegex2 = New Regex("<td headers=""re2 e1"" align=""right""
> valign=""bottom"">" & _
>                                          "((.|\n)*?)<sup>",
> RegexOptions.IgnoreCase)
>
> Now, this is suppose to get the text between the <td headers tag> and
> the <sup> tag. But, instead, it returns the entire tag including all
> of the attributes.  What am I doing wrong?
>
> Thanks!
Author
12 Jun 2009 7:47 PM
Branco
Anthony P. wrote:
<snip>
> My application needs to parse some HTML. As is usual in HTML parsing,
> I just need the data between two HTML tags. So here is my regular
> expression:
>
> Dim myRegex2 = New Regex("<td headers=""re2 e1"" align=""right""
> valign=""bottom"">" & _
>                                           "((.|\n)*?)<sup>",
> RegexOptions.IgnoreCase)
>
> Now, this is suppose to get the text between the <td headers tag> and
> the <sup> tag. But, instead, it returns the entire tag including all
> of the attributes.  What am I doing wrong?
<snip>

You probably figured it out at this point, but it seems you need to
retrieve the grouped text from the Match's Groups property (the groups
collection is 0 based, but the 0th item is the full matched text, thus
you need to retrieve group(1):

  <example>
  Dim M As Match = MyRegex2.Match(sContent)
  Do While M.Success
    '////
    Dim Text As String = M.Groups(1).Value
    '////
    '...
    'Do something with Text
    '...
    M = M.NextMatch
  Loop
  </example>

HTH

Regards,

Branco
Author
12 Jun 2009 10:35 PM
Anthony Papillion
<snip>
> You probably figured it out at this point, but it seems you need to
> retrieve the grouped text from the Match's Groups property (the groups
> collection is 0 based, but the 0th item is the full matched text, thus
> you need to retrieve group(1):
<snip?

Hi Branco,

No, I hadn't figured it out yet and I thank you for your help.  I saw
something about the match's groups the other day but it didn't click
that was what I needed thank you sir!

Anthony