|
web
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Regular expression to match person's namematches a person's name in a string of words. For example in "physicist Albert Einstein was born in Germany and" I want to match "Albert Einstein" In "scientist Sir Isaac Newton wrote the Philosophiae Naturalis Principia Mathematica" I want to match "Sir Isaac Newton" In all cases the names are capitalised and the first word in the string starts with a lower case character and the first word after the name starts with a lower case character. A regex which matches from the first uppercase character to the first lowercase character preceded by a space would work, but all my attempts have so far failed! Thanks for any help.
Show quote
Hide quote
"Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> ha scritto nel (?<name>[A-Z][a-z]+)messaggio > I'm struggling to create a regular expression for use with VB .Net which > matches a person's name in a string > of words. > > For example in "physicist Albert Einstein was born in Germany and" > I want to match "Albert Einstein" > > In "scientist Sir Isaac Newton wrote the Philosophiae Naturalis Principia > Mathematica" > I want to match "Sir Isaac Newton" > > In all cases the names are capitalised and the first word in the string > starts with a lower case character and the first word after the name > starts > with a lower case character. > > A regex which matches from the first uppercase character to the first > lowercase character preceded by a space would work, but all my attempts > have so far failed! The real problem is: how do you distinguish Albert from Germany as a valid name? "Fabio" <znt.fa***@virgilio.it> wrote in message Thanks for your reply Fabio. Your regex is the standard one for matching news:eU5eI$jvGHA.356@TK2MSFTNGP04.phx.gbl... > > (?<name>[A-Z][a-z]+) > > The real problem is: how do you distinguish Albert from Germany as a valid > name? capitalised words, and as such will match Albert, Einstein and Germany as separate words, and as you say there is no way of determining which of these is a valid name. I would like to avoid this problem by extracting the whole name in a single match. The regex in the following VB code nearly does the job: ===== Start Module1.vb ====== 'Visual Basic Console Application Imports System.Text.RegularExpressions Module Module1 Sub Main() Dim re As Regex = New Regex("([A-Z].*?)([ ][a-z])", RegexOptions.Compiled) Dim reMatch As Match Dim input, name As String input = "physicist Albert Einstein was born in Germany and" reMatch = re.Match(input) If reMatch.Success Then name = reMatch.Groups(1).Value Debug.WriteLine("|" + name + "|") End If End Sub End Module ======= End Module1.vb ======== The above regex correctly matches "Albert Einstein" in the input string as a single match, and also "Sir Isaac Newton" in my other test string. However, it's not quite correct because I also want it to match when there is nothing following the name, e.g. in "physicist Albert Einstein". Any ideas? Thanks. "Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> ha scritto nel I don't think this would work.messaggio news:eblb5m$d5g$1@news.freedom2surf.net... > The regex in the following VB code nearly does the job: > Dim re As Regex = New Regex("([A-Z].*?)([ ][a-z])", It catch also "A&&2373%% xyz" that it's sure it isn't a valid name. > However, it's not quite correct because I also want it to match when there Mine do this, and your too but your fails if there is something after the > is nothing following the name, e.g. in "physicist Albert Einstein". > > Any ideas? Thanks. name. I don't understand why the one I suggested don't works for you. "Fabio" <znt.fa***@virgilio.it> wrote in message There won't be funny characters like that in the names, so that isn't an news:uokj7alvGHA.4444@TK2MSFTNGP05.phx.gbl... > "Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> ha scritto nel > messaggio news:eblb5m$d5g$1@news.freedom2surf.net... > > >> The regex in the following VB code nearly does the job: > >> Dim re As Regex = New Regex("([A-Z].*?)([ ][a-z])", > > I don't think this would work. > It catch also "A&&2373%% xyz" that it's sure it isn't a valid name. > issue. >> However, it's not quite correct because I also want it to match when This is your regex:>> there is nothing following the name, e.g. in "physicist Albert Einstein". >> >> Any ideas? Thanks. > > Mine do this, and your too but your fails if there is something after the > name. > I don't understand why the one I suggested don't works for you. (?<name>[A-Z][a-z]+) As I said, that's the standard regex to match all capitalised words in a string, and match them as separate strings. I'd like a regex which matches the names as a single string. To restate the problem with examples: 1. "xxx xxxxx Firstname Lastname xxx" must match "Firstname Lastname" as a single string. 2. "xx xxx Firstname Middlename Lastname xx xxx" must match "Firstname Middlename Lastname" as a single string. 3. "xxxx Firstname Lastname" (nothing after Lastname) must match "Firstname Lastname" as a single string. That is the scope of the problem; nothing more, nothing less. Thanks for your help! "Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> wrote in What about New York, New Hampshire, New Orleans, Abu Dhabi, Big Island, news:ebl2ec$76l$1@news.freedom2surf.net: > A regex which matches from the first uppercase character to the first > lowercase character preceded by a space would work, but all my > attempts have so far failed! etc? Is that a person's name? :-) Name matching is quite hard to to do ... might be easier to preload a list of known names to match ... or some sort of fulltext search engine.
Show quote
Hide quote
"Spam Catcher" <spamhoneypot@rogers.com> wrote in message Yes, in my case they are 'names'. I'm not trying to determine whether news:Xns981D9E1791AA4usenethoneypotrogers@127.0.0.1... > "Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> wrote in > news:ebl2ec$76l$1@news.freedom2surf.net: > >> A regex which matches from the first uppercase character to the first >> lowercase character preceded by a space would work, but all my >> attempts have so far failed! > > What about New York, New Hampshire, New Orleans, Abu Dhabi, Big Island, > etc? Is that a person's name? :-) > > Name matching is quite hard to to do ... might be easier to preload a list > of known names to match ... or some sort of fulltext search engine. they are real names or known names. For my purposes, a 'name' within a string is a sequence of one or more capitalised words. Put simply, all the characters from the first uppercase letter in the string to the first lowercase letter preceded by a space is a name. Johnny,
You should be able to take the expression to find one Word, and modify it to find a Word followed by one or more Words separated (preceded really) by whitespace... Something like: Dim pattern As String = "(\b\p{Lu}\p{Ll}+)(\s+\p{Lu}\p{Ll}+)*\b" Static parser As New Regex(pattern, RegexOptions.Compiled) Dim inputs() As String = {"physicist Albert Einstein was born in Germany and", _ "scientist Sir Isaac Newton wrote the Philosophiae Naturalis Principia Mathematica"} For Each input As String In inputs For Each match As Match In parser.Matches(input) Debug.WriteLine(match.Value) Next Next Produces: Albert Einstein Germany Sir Isaac Newton Philosophiae Naturalis Principia Mathematica FWIW: \p{Lu} matches any upper case letter; not just A-Z; while \p{Ll} matches any lower case letter, not just a-z; For example accented & umlated letters or letters in other alphabets... http://msdn2.microsoft.com/en-us/library/system.globalization.unicodecategory.aspx The \b ensures the phrases start on & end on a "word boundary" (Albert will match, but Bert in alBert will not). -- Show quoteHide quoteHope this helps Jay B. Harlow [MVP - Outlook] ..NET Application Architect, Enthusiast, & Evangelist T.S. Bradley - http://www.tsbradley.net "Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> wrote in message news:eblbg5$dbg$1@news.freedom2surf.net... | | "Spam Catcher" <spamhoneypot@rogers.com> wrote in message | news:Xns981D9E1791AA4usenethoneypotrogers@127.0.0.1... | > "Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> wrote in | > news:ebl2ec$76l$1@news.freedom2surf.net: | > | >> A regex which matches from the first uppercase character to the first | >> lowercase character preceded by a space would work, but all my | >> attempts have so far failed! | > | > What about New York, New Hampshire, New Orleans, Abu Dhabi, Big Island, | > etc? Is that a person's name? :-) | > | > Name matching is quite hard to to do ... might be easier to preload a list | > of known names to match ... or some sort of fulltext search engine. | | Yes, in my case they are 'names'. I'm not trying to determine whether | they are real names or known names. | | For my purposes, a 'name' within a string is a sequence of one or more | capitalised words. Put simply, all the characters from the first uppercase | letter in the string to the first lowercase letter preceded by a space is a | name. | |
Show quote
Hide quote
"Jay B. Harlow [MVP - Outlook]" <Jay_Harlow_***@tsbradley.net> wrote in Brilliant! That works nicely.message news:%23BpvyrovGHA.1808@TK2MSFTNGP06.phx.gbl... > Johnny, > You should be able to take the expression to find one Word, and modify it > to > find a Word followed by one or more Words separated (preceded really) by > whitespace... > > Something like: > > Dim pattern As String = "(\b\p{Lu}\p{Ll}+)(\s+\p{Lu}\p{Ll}+)*\b" > Static parser As New Regex(pattern, RegexOptions.Compiled) > > Dim inputs() As String = {"physicist Albert Einstein was born in > Germany and", _ > "scientist Sir Isaac Newton wrote the Philosophiae Naturalis > Principia Mathematica"} > > For Each input As String In inputs > For Each match As Match In parser.Matches(input) > Debug.WriteLine(match.Value) > Next > Next > > Produces: > Albert Einstein > Germany > Sir Isaac Newton > Philosophiae Naturalis Principia Mathematica > > FWIW: \p{Lu} matches any upper case letter; not just A-Z; while \p{Ll} > matches any lower case letter, not just a-z; For example accented & > umlated > letters or letters in other alphabets... > > http://msdn2.microsoft.com/en-us/library/system.globalization.unicodecategory.aspx > > The \b ensures the phrases start on & end on a "word boundary" (Albert > will > match, but Bert in alBert will not). > > -- > Hope this helps > Jay B. Harlow [MVP - Outlook] > .NET Application Architect, Enthusiast, & Evangelist > T.S. Bradley - http://www.tsbradley.net Thanks Jay. A couple other exceptions to watch for:
"The Albert Einstein Center at the University..." (match only Albert Einstein?) "Shawn O'Malley stepped up to the plate..." (Match Shawn O'Malley?) Jim Wooley http://devauthority.com/blogs/jwooley/default.aspx Show quoteHide quote > "Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> wrote in > news:ebl2ec$76l$1@news.freedom2surf.net: > >> A regex which matches from the first uppercase character to the first >> lowercase character preceded by a space would work, but all my >> attempts have so far failed! >> > What about New York, New Hampshire, New Orleans, Abu Dhabi, Big > Island, etc? Is that a person's name? :-) > > Name matching is quite hard to to do ... might be easier to preload a > list of known names to match ... or some sort of fulltext search > engine. > Hi,
try this: Dim pattern As String = "^(\S+\s)(([A-Z]+\S*\s)+)((\S*\s*)*)$" Label1.Text = Regex.Replace(TextBox1.Text, pattern, "$2") Show quoteHide quote "Johnny Williams" <REMOVEjohnwilliams_***@NOThotmail.com> schrieb im Newsbeitrag news:ebl2ec$76l$1@news.freedom2surf.net... > I'm struggling to create a regular expression for use with VB .Net which > matches a person's name in a string > of words. > > For example in "physicist Albert Einstein was born in Germany and" > I want to match "Albert Einstein" > > In "scientist Sir Isaac Newton wrote the Philosophiae Naturalis Principia > Mathematica" > I want to match "Sir Isaac Newton" > > In all cases the names are capitalised and the first word in the string > starts with a lower case character and the first word after the name > starts > with a lower case character. > > A regex which matches from the first uppercase character to the first > lowercase character preceded by a space would work, but all my attempts > have so far failed! > > Thanks for any help. > > > > > > Johnny Williams wrote:
Show quoteHide quote > I'm struggling to create a regular expression for use with VB .Net which Have you concidered names like Ferdinand von Zeppelin or mine Rinze van > matches a person's name in a string > of words. > > For example in "physicist Albert Einstein was born in Germany and" > I want to match "Albert Einstein" > > In "scientist Sir Isaac Newton wrote the Philosophiae Naturalis Principia > Mathematica" > I want to match "Sir Isaac Newton" > > In all cases the names are capitalised and the first word in the string > starts with a lower case character and the first word after the name starts > with a lower case character. > > A regex which matches from the first uppercase character to the first > lowercase character preceded by a space would work, but all my attempts have > so far failed! > > Thanks for any help. > > Huizen. The complete name *includes* the von or van part, yet saying a name always has capitalised sequential words is wrong in this case. -- Rinze van Huizen C-Services Holland b.v
Show quote
Hide quote
"C-Services Holland b.v." <c**@REMOVEcsh4u.nl> wrote in message Hi Rinze, no I hadn't considered names like yours. In my case the full name news:ieydnXlFwq6G63zZRVnytQ@zeelandnet.nl... > Johnny Williams wrote: > >> I'm struggling to create a regular expression for use with VB .Net which >> matches a person's name in a string >> of words. >> >> For example in "physicist Albert Einstein was born in Germany and" >> I want to match "Albert Einstein" >> >> In "scientist Sir Isaac Newton wrote the Philosophiae Naturalis Principia >> Mathematica" >> I want to match "Sir Isaac Newton" >> >> In all cases the names are capitalised and the first word in the string >> starts with a lower case character and the first word after the name >> starts >> with a lower case character. >> >> A regex which matches from the first uppercase character to the first >> lowercase character preceded by a space would work, but all my attempts >> have so far failed! >> >> Thanks for any help. >> >> > > Have you concidered names like Ferdinand von Zeppelin or mine Rinze van > Huizen. The complete name *includes* the von or van part, yet saying a > name always has capitalised sequential words is wrong in this case. > > > -- > Rinze van Huizen > C-Services Holland b.v always consists of 2 or 3 capitalised names so this issue won't arise. Thanks for your contribution.
Array in Structure
Problem with ComboBox execution from command line Help! pumping wait primitives - what is that? Run an Access report in a VB .Net application Windows Service losing variable value? How to detect if program execution is in debug mode? Form Layout TimeSpan issue Build Release with VS2003 on machine |
|||||||||||||||||||||||