Home All Groups Group Topic Archive Search About

A Question About Regular Expressions and Capture

Author
13 Jun 2006 12:57 AM
eBob.com
I am using regular expressions and a particular feature called "capture" (I
think) to suck some information out of some html.  I could have never come
up with this myself but Balena has an example which is very similar to this.
The guts of the program is ...

Dim i As Integer
Dim rgx As Regex

Dim Pattern As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2 width=""65%"">(<b>){0,1}(?<value>.+)(</b>){0,1}</td>"

Dim Pattern2 As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2
width=""65%"">(<b>){0,1}((?<value>.+))(</b>){0,1}</td>"  ' extra parenthesis
don't help

rgx = New Regex(Pattern)

tbxPattern.Text = Pattern

Dim m As Match, g As Group

For Each m In rgx.Matches(tbxInput.Text)

g = m.Groups("variable")

lstbxKeys.Items.Add(g.Value)

g = m.Groups("value")

lstbxValues.Items.Add(g.Value)

Next

The data looks like this (below).  It works fine for all cases except the
first (the "Celular" data) where the value is picked up as
"123-abc-5678</b>".  I want, and I think it should be, "123-abc-5678".   I
can't understand why the "</b>" is included in the value.   Doesn't my
pattern clearly show that the value is a string of one or more characters,
terminated by, optionally, "</b>" followed by "</td>".   Is there a
straightforward way to tell it to not include the "</b>" in the value?  Note
that the "</b>" is not always present so the pattern has to say that it is
optional.

Thank,  Bob


   <tr height=24>
     <td class=td1 width="35%"><b>Celular</td>
     <td width=1><img src="../img/p.gif" width=1 height=1></td>
          <td class=td2 width="65%"><b>123-abc-5678</b></td>
         </tr>



        <tr height=24>
     <td class=td1 width="35%">Edad</td>
     <td width=1><img src="../img/p.gif" width=1 height=1></td>
     <td class=td2 width="65%">24 Años</td>
    </tr>

        <tr height=24>
     <td class=td1 width="35%">Altura</td>
     <td width=1><img src="../img/p.gif" width=1 height=1></td>
     <td class=td2 width="65%">1.70 mts.</td>

Author
13 Jun 2006 9:25 AM
Larry Lard
eBob.com wrote:
Show quoteHide quote
> I am using regular expressions and a particular feature called "capture" (I
> think) to suck some information out of some html.  I could have never come
> up with this myself but Balena has an example which is very similar to this.
> The guts of the program is ...
>
> Dim i As Integer
> Dim rgx As Regex
>
> Dim Pattern As String = "<td class=td1
> width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _
>
> "\s*.*\s*<td class=td2 width=""65%"">(<b>){0,1}(?<value>.+)(</b>){0,1}</td>"
>
> Dim Pattern2 As String = "<td class=td1
> width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _
>
> "\s*.*\s*<td class=td2
> width=""65%"">(<b>){0,1}((?<value>.+))(</b>){0,1}</td>"  ' extra parenthesis
> don't help
>
> rgx = New Regex(Pattern)
>
> tbxPattern.Text = Pattern
>
> Dim m As Match, g As Group
>
> For Each m In rgx.Matches(tbxInput.Text)
>
> g = m.Groups("variable")
>
> lstbxKeys.Items.Add(g.Value)
>
> g = m.Groups("value")
>
> lstbxValues.Items.Add(g.Value)
>
> Next
>
> The data looks like this (below).  It works fine for all cases except the
> first (the "Celular" data) where the value is picked up as
> "123-abc-5678</b>".  I want, and I think it should be, "123-abc-5678".   I
> can't understand why the "</b>" is included in the value.   Doesn't my
> pattern clearly show that the value is a string of one or more characters,
> terminated by, optionally, "</b>" followed by "</td>".

Yes, but remember that regexes are 'greedy' by default - they always
capture as many characters as they can. Thus when given a choice
between:

value: 123-abc-5678</b>
optional </b>: no

and

value: 123-abc-5678
optional </b>: yes

since the 'value' match happens first, and it can legitimately capture
everything including the </b>, it does so.

>  Is there a
> straightforward way to tell it to not include the "</b>" in the value?

How about, instead of value capturing one or more of any character with


..+

you instead capture one or more characters that aren't < with

[^<]+

Also, there are flags you can put in to make expressions non-greedy,
but I don't think that will work in this situation.

BUT

I would *urge* you  to stop trying to parse HTML with regex, and
instead run (don't walk) to
<http://smourier.blogspot.com/2005/05/net-html-agility-pack-how-to-use.html>,
and from there download HtmlAgilityPack, which is an absolutely
invaluable library that converts (even malformed) HTML into a nice XML
document tree. It makes doing HTML parsing a hundred times more easy
than trying to use regex.

--
Larry Lard
Replies to group please
Author
13 Jun 2006 11:36 AM
eBob.com
Thank you very much Larry.  It finally occurred to me that there had to be
some way to take advantage of the fact that the string I am after does not
contain "<", but the only solution I could think of was very ugly.  Your
suggestion is much, much better.  And thank you for making me aware of the
HtmlAgilityPack, I will be looking into it.

Thanks,  Bob

Show quoteHide quote
"Larry Lard" <larryl***@hotmail.com> wrote in message
news:1150190711.583629.75450@u72g2000cwu.googlegroups.com...
>
> eBob.com wrote:
>> I am using regular expressions and a particular feature called "capture"
>> (I
>> think) to suck some information out of some html.  I could have never
>> come
>> up with this myself but Balena has an example which is very similar to
>> this.
>> The guts of the program is ...
>>
>> Dim i As Integer
>> Dim rgx As Regex
>>
>> Dim Pattern As String = "<td class=td1
>> width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _
>>
>> "\s*.*\s*<td class=td2
>> width=""65%"">(<b>){0,1}(?<value>.+)(</b>){0,1}</td>"
>>
>> Dim Pattern2 As String = "<td class=td1
>> width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _
>>
>> "\s*.*\s*<td class=td2
>> width=""65%"">(<b>){0,1}((?<value>.+))(</b>){0,1}</td>"  ' extra
>> parenthesis
>> don't help
>>
>> rgx = New Regex(Pattern)
>>
>> tbxPattern.Text = Pattern
>>
>> Dim m As Match, g As Group
>>
>> For Each m In rgx.Matches(tbxInput.Text)
>>
>> g = m.Groups("variable")
>>
>> lstbxKeys.Items.Add(g.Value)
>>
>> g = m.Groups("value")
>>
>> lstbxValues.Items.Add(g.Value)
>>
>> Next
>>
>> The data looks like this (below).  It works fine for all cases except the
>> first (the "Celular" data) where the value is picked up as
>> "123-abc-5678</b>".  I want, and I think it should be, "123-abc-5678".
>> I
>> can't understand why the "</b>" is included in the value.   Doesn't my
>> pattern clearly show that the value is a string of one or more
>> characters,
>> terminated by, optionally, "</b>" followed by "</td>".
>
> Yes, but remember that regexes are 'greedy' by default - they always
> capture as many characters as they can. Thus when given a choice
> between:
>
> value: 123-abc-5678</b>
> optional </b>: no
>
> and
>
> value: 123-abc-5678
> optional </b>: yes
>
> since the 'value' match happens first, and it can legitimately capture
> everything including the </b>, it does so.
>
>>  Is there a
>> straightforward way to tell it to not include the "</b>" in the value?
>
> How about, instead of value capturing one or more of any character with
>
>
> .+
>
> you instead capture one or more characters that aren't < with
>
> [^<]+
>
> Also, there are flags you can put in to make expressions non-greedy,
> but I don't think that will work in this situation.
>
> BUT
>
> I would *urge* you  to stop trying to parse HTML with regex, and
> instead run (don't walk) to
> <http://smourier.blogspot.com/2005/05/net-html-agility-pack-how-to-use.html>,
> and from there download HtmlAgilityPack, which is an absolutely
> invaluable library that converts (even malformed) HTML into a nice XML
> document tree. It makes doing HTML parsing a hundred times more easy
> than trying to use regex.
>
> --
> Larry Lard
> Replies to group please
>