Home All Groups Group Topic Archive Search About

Regular expression to match person's name

Author
12 Aug 2006 5:17 PM
Johnny Williams
I'm struggling to create a regular expression for use with VB .Net which
matches a person's name in a string
of words.

For example in "physicist Albert Einstein was born in Germany and"
I want to match "Albert Einstein"

In "scientist Sir Isaac Newton wrote the Philosophiae Naturalis Principia
Mathematica"
I want to match "Sir Isaac Newton"

In all cases the names are capitalised and the first word in the string
starts with a lower case character and the first word after the name starts
with a lower case character.

A regex which matches from the first uppercase character to the first
lowercase character preceded by a space would work, but all my attempts have
so far failed!

Thanks for any help.

Author
12 Aug 2006 6:47 PM
Fabio
Show quote Hide quote
"Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> ha scritto nel
messaggio

> I'm struggling to create a regular expression for use with VB .Net which
> matches a person's name in a string
> of words.
>
> For example in "physicist Albert Einstein was born in Germany and"
> I want to match "Albert Einstein"
>
> In "scientist Sir Isaac Newton wrote the Philosophiae Naturalis Principia
> Mathematica"
> I want to match "Sir Isaac Newton"
>
> In all cases the names are capitalised and the first word in the string
> starts with a lower case character and the first word after the name
> starts
> with a lower case character.
>
> A regex which matches from the first uppercase character to the first
> lowercase character preceded by a space would work, but all my attempts
> have so far failed!

(?<name>[A-Z][a-z]+)

The real problem is: how do you distinguish Albert from Germany as a valid
name?


--

Free .Net Reporting Tool - http://www.neodatatype.net
Author
12 Aug 2006 7:45 PM
Johnny Williams
"Fabio" <znt.fa***@virgilio.it> wrote in message
news:eU5eI$jvGHA.356@TK2MSFTNGP04.phx.gbl...
>
> (?<name>[A-Z][a-z]+)
>
> The real problem is: how do you distinguish Albert from Germany as a valid
> name?

Thanks for your reply Fabio.  Your regex is the standard one for matching
capitalised words, and as such will match Albert, Einstein and Germany as
separate words, and as you say there is no way of determining which of these
is a valid name.  I would like to avoid this problem by extracting the whole
name in a single match.

The regex in the following VB code nearly does the job:

===== Start Module1.vb ======

'Visual Basic Console Application

Imports System.Text.RegularExpressions

Module Module1

    Sub Main()

        Dim re As Regex = New Regex("([A-Z].*?)([ ][a-z])",
RegexOptions.Compiled)
        Dim reMatch As Match
        Dim input, name As String

        input = "physicist Albert Einstein was born in Germany and"

        reMatch = re.Match(input)

        If reMatch.Success Then
            name = reMatch.Groups(1).Value
            Debug.WriteLine("|" + name + "|")
        End If

    End Sub

End Module

======= End Module1.vb ========

The above regex correctly matches "Albert Einstein" in the input string as a
single match, and also "Sir Isaac Newton" in my other test string.

However, it's not quite correct because I also want it to match when there
is nothing following the name, e.g. in "physicist Albert Einstein".

Any ideas?  Thanks.
Author
12 Aug 2006 9:31 PM
Fabio
"Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> ha scritto nel
messaggio news:eblb5m$d5g$1@news.freedom2surf.net...


> The regex in the following VB code nearly does the job:

>        Dim re As Regex = New Regex("([A-Z].*?)([ ][a-z])",

I don't think this would work.
It catch also "A&&2373%% xyz" that it's sure it isn't a valid name.

> However, it's not quite correct because I also want it to match when there
> is nothing following the name, e.g. in "physicist Albert Einstein".
>
> Any ideas?  Thanks.

Mine do this, and your too but your fails if there is something after the
name.
I don't understand why the one I suggested don't works for you.


--

Free .Net Reporting Tool - http://www.neodatatype.net
Author
12 Aug 2006 11:12 PM
Johnny Williams
"Fabio" <znt.fa***@virgilio.it> wrote in message
news:uokj7alvGHA.4444@TK2MSFTNGP05.phx.gbl...
> "Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> ha scritto nel
> messaggio news:eblb5m$d5g$1@news.freedom2surf.net...
>
>
>> The regex in the following VB code nearly does the job:
>
>>        Dim re As Regex = New Regex("([A-Z].*?)([ ][a-z])",
>
> I don't think this would work.
> It catch also "A&&2373%% xyz" that it's sure it isn't a valid name.
>

There won't be funny characters like that in the names, so that isn't an
issue.

>> However, it's not quite correct because I also want it to match when
>> there is nothing following the name, e.g. in "physicist Albert Einstein".
>>
>> Any ideas?  Thanks.
>
> Mine do this, and your too but your fails if there is something after the
> name.
> I don't understand why the one I suggested don't works for you.

This is your regex:

(?<name>[A-Z][a-z]+)

As I said, that's the standard regex to match all capitalised words in a
string, and match them as separate strings.  I'd like a regex which matches
the names as a single string.

To restate the problem with examples:

1. "xxx xxxxx Firstname Lastname xxx" must match "Firstname Lastname" as a
single string.
2. "xx xxx Firstname Middlename Lastname xx xxx" must match "Firstname
Middlename Lastname" as a single string.
3. "xxxx Firstname Lastname" (nothing after Lastname) must match "Firstname
Lastname" as a single string.

That is the scope of the problem; nothing more, nothing less.

Thanks for your help!
Author
12 Aug 2006 7:32 PM
Spam Catcher
"Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> wrote in
news:ebl2ec$76l$1@news.freedom2surf.net:

> A regex which matches from the first uppercase character to the first
> lowercase character preceded by a space would work, but all my
> attempts have so far failed!

What about New York, New Hampshire, New Orleans, Abu Dhabi, Big Island,
etc? Is that a person's name? :-)

Name matching is quite hard to to do ... might be easier to preload a list
of known names to match ... or some sort of fulltext search engine.
Author
12 Aug 2006 7:51 PM
Johnny Williams
Show quote Hide quote
"Spam Catcher" <spamhoneypot@rogers.com> wrote in message
news:Xns981D9E1791AA4usenethoneypotrogers@127.0.0.1...
> "Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> wrote in
> news:ebl2ec$76l$1@news.freedom2surf.net:
>
>> A regex which matches from the first uppercase character to the first
>> lowercase character preceded by a space would work, but all my
>> attempts have so far failed!
>
> What about New York, New Hampshire, New Orleans, Abu Dhabi, Big Island,
> etc? Is that a person's name? :-)
>
> Name matching is quite hard to to do ... might be easier to preload a list
> of known names to match ... or some sort of fulltext search engine.

Yes, in my case they are 'names'.    I'm not trying to determine whether
they are real names or known names.

For my purposes, a 'name' within a string is a sequence of one or more
capitalised words.  Put simply, all the characters from the first uppercase
letter in the string to the first lowercase letter preceded by a space is a
name.
Author
13 Aug 2006 3:45 AM
Jay B. Harlow [MVP - Outlook]
Johnny,
You should be able to take the expression to find one Word, and modify it to
find a Word followed by one or more Words separated (preceded really) by
whitespace...

Something like:

        Dim pattern As String = "(\b\p{Lu}\p{Ll}+)(\s+\p{Lu}\p{Ll}+)*\b"
        Static parser As New Regex(pattern, RegexOptions.Compiled)

        Dim inputs() As String = {"physicist Albert Einstein was born in
Germany and", _
            "scientist Sir Isaac Newton wrote the Philosophiae Naturalis
Principia Mathematica"}

        For Each input As String In inputs
            For Each match As Match In parser.Matches(input)
                Debug.WriteLine(match.Value)
            Next
        Next

Produces:
    Albert Einstein
    Germany
    Sir Isaac Newton
    Philosophiae Naturalis Principia Mathematica

FWIW: \p{Lu} matches any upper case letter; not just A-Z; while \p{Ll}
matches any lower case letter, not just a-z; For example accented & umlated
letters or letters in other alphabets...

http://msdn2.microsoft.com/en-us/library/system.globalization.unicodecategory.aspx

The \b ensures the phrases start on & end on a "word boundary" (Albert will
match, but Bert in alBert will not).

--
Hope this helps
Jay B. Harlow [MVP - Outlook]
..NET Application Architect, Enthusiast, & Evangelist
T.S. Bradley - http://www.tsbradley.net


Show quoteHide quote
"Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> wrote in message
news:eblbg5$dbg$1@news.freedom2surf.net...
|
| "Spam Catcher" <spamhoneypot@rogers.com> wrote in message
| news:Xns981D9E1791AA4usenethoneypotrogers@127.0.0.1...
| > "Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> wrote in
| > news:ebl2ec$76l$1@news.freedom2surf.net:
| >
| >> A regex which matches from the first uppercase character to the first
| >> lowercase character preceded by a space would work, but all my
| >> attempts have so far failed!
| >
| > What about New York, New Hampshire, New Orleans, Abu Dhabi, Big Island,
| > etc? Is that a person's name? :-)
| >
| > Name matching is quite hard to to do ... might be easier to preload a
list
| > of known names to match ... or some sort of fulltext search engine.
|
| Yes, in my case they are 'names'.    I'm not trying to determine whether
| they are real names or known names.
|
| For my purposes, a 'name' within a string is a sequence of one or more
| capitalised words.  Put simply, all the characters from the first
uppercase
| letter in the string to the first lowercase letter preceded by a space is
a
| name.
|
|
Author
13 Aug 2006 3:23 PM
Johnny Williams
Show quote Hide quote
"Jay B. Harlow [MVP - Outlook]" <Jay_Harlow_***@tsbradley.net> wrote in
message news:%23BpvyrovGHA.1808@TK2MSFTNGP06.phx.gbl...
> Johnny,
> You should be able to take the expression to find one Word, and modify it
> to
> find a Word followed by one or more Words separated (preceded really) by
> whitespace...
>
> Something like:
>
>        Dim pattern As String = "(\b\p{Lu}\p{Ll}+)(\s+\p{Lu}\p{Ll}+)*\b"
>        Static parser As New Regex(pattern, RegexOptions.Compiled)
>
>        Dim inputs() As String = {"physicist Albert Einstein was born in
> Germany and", _
>            "scientist Sir Isaac Newton wrote the Philosophiae Naturalis
> Principia Mathematica"}
>
>        For Each input As String In inputs
>            For Each match As Match In parser.Matches(input)
>                Debug.WriteLine(match.Value)
>            Next
>        Next
>
> Produces:
>    Albert Einstein
>    Germany
>    Sir Isaac Newton
>    Philosophiae Naturalis Principia Mathematica
>
> FWIW: \p{Lu} matches any upper case letter; not just A-Z; while \p{Ll}
> matches any lower case letter, not just a-z; For example accented &
> umlated
> letters or letters in other alphabets...
>
> http://msdn2.microsoft.com/en-us/library/system.globalization.unicodecategory.aspx
>
> The \b ensures the phrases start on & end on a "word boundary" (Albert
> will
> match, but Bert in alBert will not).
>
> --
> Hope this helps
> Jay B. Harlow [MVP - Outlook]
> .NET Application Architect, Enthusiast, & Evangelist
> T.S. Bradley - http://www.tsbradley.net

Brilliant!  That works nicely.

Thanks Jay.
Author
14 Aug 2006 2:02 PM
Jim Wooley
A couple other exceptions to watch for:

"The Albert Einstein Center at the University..." (match only Albert Einstein?)
"Shawn O'Malley stepped up to the plate..." (Match Shawn O'Malley?)

Jim Wooley
http://devauthority.com/blogs/jwooley/default.aspx

Show quoteHide quote
> "Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> wrote in
> news:ebl2ec$76l$1@news.freedom2surf.net:
>
>> A regex which matches from the first uppercase character to the first
>> lowercase character preceded by a space would work, but all my
>> attempts have so far failed!
>>
> What about New York, New Hampshire, New Orleans, Abu Dhabi, Big
> Island, etc? Is that a person's name? :-)
>
> Name matching is quite hard to to do ... might be easier to preload a
> list of known names to match ... or some sort of fulltext search
> engine.
>
Author
13 Aug 2006 11:01 PM
Lars Graeve
Hi,

try this:

Dim pattern As String = "^(\S+\s)(([A-Z]+\S*\s)+)((\S*\s*)*)$"

Label1.Text = Regex.Replace(TextBox1.Text, pattern, "$2")



Show quoteHide quote
"Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> schrieb im
Newsbeitrag news:ebl2ec$76l$1@news.freedom2surf.net...
> I'm struggling to create a regular expression for use with VB .Net which
> matches a person's name in a string
> of words.
>
> For example in "physicist Albert Einstein was born in Germany and"
> I want to match "Albert Einstein"
>
> In "scientist Sir Isaac Newton wrote the Philosophiae Naturalis Principia
> Mathematica"
> I want to match "Sir Isaac Newton"
>
> In all cases the names are capitalised and the first word in the string
> starts with a lower case character and the first word after the name
> starts
> with a lower case character.
>
> A regex which matches from the first uppercase character to the first
> lowercase character preceded by a space would work, but all my attempts
> have so far failed!
>
> Thanks for any help.
>
>
>
>
>
>
Author
15 Aug 2006 7:27 AM
C-Services Holland b.v.
Johnny Williams wrote:

Show quoteHide quote
> I'm struggling to create a regular expression for use with VB .Net which
> matches a person's name in a string
> of words.
>
> For example in "physicist Albert Einstein was born in Germany and"
> I want to match "Albert Einstein"
>
> In "scientist Sir Isaac Newton wrote the Philosophiae Naturalis Principia
> Mathematica"
> I want to match "Sir Isaac Newton"
>
> In all cases the names are capitalised and the first word in the string
> starts with a lower case character and the first word after the name starts
> with a lower case character.
>
> A regex which matches from the first uppercase character to the first
> lowercase character preceded by a space would work, but all my attempts have
> so far failed!
>
> Thanks for any help.
>
>

Have you concidered names like Ferdinand von Zeppelin or mine Rinze van
Huizen. The complete name *includes* the von or van part, yet saying a
name always has capitalised sequential words is wrong in this case.


--
Rinze van Huizen
C-Services Holland b.v
Author
15 Aug 2006 8:13 PM
Johnny Williams
Show quote Hide quote
"C-Services Holland b.v." <c**@REMOVEcsh4u.nl> wrote in message
news:ieydnXlFwq6G63zZRVnytQ@zeelandnet.nl...
> Johnny Williams wrote:
>
>> I'm struggling to create a regular expression for use with VB .Net which
>> matches a person's name in a string
>> of words.
>>
>> For example in "physicist Albert Einstein was born in Germany and"
>> I want to match "Albert Einstein"
>>
>> In "scientist Sir Isaac Newton wrote the Philosophiae Naturalis Principia
>> Mathematica"
>> I want to match "Sir Isaac Newton"
>>
>> In all cases the names are capitalised and the first word in the string
>> starts with a lower case character and the first word after the name
>> starts
>> with a lower case character.
>>
>> A regex which matches from the first uppercase character to the first
>> lowercase character preceded by a space would work, but all my attempts
>> have so far failed!
>>
>> Thanks for any help.
>>
>>
>
> Have you concidered names like Ferdinand von Zeppelin or mine Rinze van
> Huizen. The complete name *includes* the von or van part, yet saying a
> name always has capitalised sequential words is wrong in this case.
>
>
> --
> Rinze van Huizen
> C-Services Holland b.v

Hi Rinze, no I hadn't considered names like yours.  In my case the full name
always consists of 2 or 3 capitalised names so this issue won't arise.

Thanks for your contribution.