Home All Groups Group Topic Archive Search About
Author
20 Nov 2006 3:48 PM
Paul
I have a process that I want to speed up.  It first was written in
Microsoft Access then converted to VB.NET.  I would like to hear some
suggestions on how to speed it up.  The process is to match names using
a Soundex function.  It reads through a table of about 20,000 records
and matches it with a table of about 9,000 records.  That is all done
in code.  The tables are stored in SQL Server 2005.

Soundex is a numerical formula to generate a number based off of a
string.  The first thing done is to generate the Soundex value for the
20,000 records.  The matching table of 9,000 records already has it's
Soundex value calculated for it.  Like I said, this part is done in
code.  It runs for about 6 minutes.

Based off of the Soundex values, it generates about 300,000 matches.
Each of these 300,000 matches then has to compare the names to get a
percent match.  We have tried to write a stored proc to handle this,
but it is really slow.  The 300,000 records are first being written to
a table and then each of the 300,000 get their percent match
calculated.  This takes anywhere from 10 to 15 minutes.

Any suggestions on how to speed this up?

Author
20 Nov 2006 4:48 PM
Spam Catcher
"Paul" <pwh***@hotmail.com> wrote in news:1164037715.384036.201420
@e3g2000cwe.googlegroups.com:

> I have a process that I want to speed up.  It first was written in
> Microsoft Access then converted to VB.NET.  I would like to hear some
> suggestions on how to speed it up.  The process is to match names using
> a Soundex function.  It reads through a table of about 20,000 records
> and matches it with a table of about 9,000 records.  That is all done
> in code.  The tables are stored in SQL Server 2005.

Why aren't you using SQL Server's Soundex or Full-Text Search capabilities?

> Based off of the Soundex values, it generates about 300,000 matches.
> Each of these 300,000 matches then has to compare the names to get a
> percent match.  We have tried to write a stored proc to handle this,
> but it is really slow.  The 300,000 records are first being written to
> a table and then each of the 300,000 get their percent match
> calculated.  This takes anywhere from 10 to 15 minutes.

Why is your function returning 300,000 matches? Perhaps your filters aren't
specific enough?

> Any suggestions on how to speed this up?

Anything wrong with SQL Server's Soundex function?

http://msdn2.microsoft.com/en-us/library/ms187384.aspx
Author
20 Nov 2006 5:59 PM
Paul
Yes, I thought about using that, but that is not really where the time
problem is.  Plus, I do a lot other massaging of the data prior to when
the Soundex values are calculated.  The real time problems are when the
300,000 are first INSERTed and then the same 300,000 are processed to
calculate the percent match.

We have two ways of doing this.  The first was through code and the
second was through a stored proc.  Both take about the same amount of
time.
Author
20 Nov 2006 7:27 PM
Spam Catcher
"Paul" <pwh***@hotmail.com> wrote in news:1164045599.255769.59430
@h54g2000cwb.googlegroups.com:

> Yes, I thought about using that, but that is not really where the time
> problem is.  Plus, I do a lot other massaging of the data prior to when
> the Soundex values are calculated.  The real time problems are when the
> 300,000 are first INSERTed and then the same 300,000 are processed to
> calculate the percent match.

Inserting 300,000 records is ALOT of data. How are you doing this right
now?  Perhaps take a look at SQL DTS or SQL Integration Services.

Or even better - massage the data during the load process... so that you
have clean data to work with : )
Author
20 Nov 2006 7:13 PM
RobinS
Assuming it doesn't change unless the data stored therein
changes, store the Soundex numbers for the 20,000 records
instead of calculating it on the fly.

How are you comparing the names? Are they identical?
Could you do an inner join to pick up the ones that
match and compare the Soundex values at the same time?

If you can't do the compares at the same time, which
gives you a smaller sample? Can you start with that
and then apply the other compare? Do they have to be
done in a specific order?

If you have 20,000 records and 9,000 records and matching
them gives you 300,000 records, you have a serious cartesian
join going on. Is there any way you can reduce that?

Not knowing your data or your data structures, those are
my first ideas.

Robin S.
---------------------
Show quoteHide quote
"Paul" <pwh***@hotmail.com> wrote in message
news:1164037715.384036.201420@e3g2000cwe.googlegroups.com...
>I have a process that I want to speed up.  It first was written in
> Microsoft Access then converted to VB.NET.  I would like to hear some
> suggestions on how to speed it up.  The process is to match names using
> a Soundex function.  It reads through a table of about 20,000 records
> and matches it with a table of about 9,000 records.  That is all done
> in code.  The tables are stored in SQL Server 2005.
>
> Soundex is a numerical formula to generate a number based off of a
> string.  The first thing done is to generate the Soundex value for the
> 20,000 records.  The matching table of 9,000 records already has it's
> Soundex value calculated for it.  Like I said, this part is done in
> code.  It runs for about 6 minutes.
>
> Based off of the Soundex values, it generates about 300,000 matches.
> Each of these 300,000 matches then has to compare the names to get a
> percent match.  We have tried to write a stored proc to handle this,
> but it is really slow.  The 300,000 records are first being written to
> a table and then each of the 300,000 get their percent match
> calculated.  This takes anywhere from 10 to 15 minutes.
>
> Any suggestions on how to speed this up?
>
Author
20 Nov 2006 7:26 PM
Paul
Thanks for the reply Robin.  Below are my answers...

RobinS wrote:
> Assuming it doesn't change unless the data stored therein
> changes, store the Soundex numbers for the 20,000 records
> instead of calculating it on the fly.

I will consider that.  It would save time, but require more of a
reengineering since we do not store the Soundex value for future use.
We use it and then discard it.

>
> How are you comparing the names? Are they identical?
> Could you do an inner join to pick up the ones that
> match and compare the Soundex values at the same time?

99.99% of the names (or 100%) will not match exactly.  So that will not
help.

>
> If you can't do the compares at the same time, which
> gives you a smaller sample? Can you start with that
> and then apply the other compare? Do they have to be
> done in a specific order?

I'm not sure what you are getting at here.  But the matches are not
done in any order.

>
> If you have 20,000 records and 9,000 records and matching
> them gives you 300,000 records, you have a serious cartesian
> join going on. Is there any way you can reduce that?

What you said is correct and that is my problem.  The 300,000 records
are what I get when the join is done between the 20,000 Soundex values
and the 9,000 Soundex values.  There really is no way to get around the
300,000 records.  The 300,000 records are valid matches according to
the Soundex values.  Then I calculate the percent match between the
names.  We need to record the highest percent match for those that
match below 80% and store all matches that are 80% and above.

Show quoteHide quote
>
> Not knowing your data or your data structures, those are
> my first ideas.
>
> Robin S.
Author
26 Nov 2006 7:53 AM
Michel Posseth [MCP]
Seems like the Bottle neck is the IO of the 300.000 records


wich is actually not so much for SQL server , i wrote ones a proggy that
inserted millions of records in a sql database  this was  12 GB mysql dump
file i neede to do this import as fast as possible as it was production data
coming from a third party .


I tried everyhting , ODBC to a seperate mysql DB took 14 hours to complete ,
DTS wasn`t possible as the file contained SQL statements ( DDL , and inserts
in MYSQL native format )  ....and ....... not possible ........Long story
........:-)

In the end  i converted the data in memory , and created sperate SQL  insert
startements on the fly , i saved these statements in batches of 1000 inserts
seperated by a ; ( dot comma  ; )  in a string builder   and ececuted the
SQL on a command object , the hole proggy took now minutes to complete.

so my advise batch your IO  in this way and you will see a hughe performance
benefit


   regards

Michel Posseth


Show quoteHide quote
"Paul" <pwh***@hotmail.com> schreef in bericht
news:1164050777.701095.255670@f16g2000cwb.googlegroups.com...
> Thanks for the reply Robin.  Below are my answers...
>
> RobinS wrote:
>> Assuming it doesn't change unless the data stored therein
>> changes, store the Soundex numbers for the 20,000 records
>> instead of calculating it on the fly.
>
> I will consider that.  It would save time, but require more of a
> reengineering since we do not store the Soundex value for future use.
> We use it and then discard it.
>
>>
>> How are you comparing the names? Are they identical?
>> Could you do an inner join to pick up the ones that
>> match and compare the Soundex values at the same time?
>
> 99.99% of the names (or 100%) will not match exactly.  So that will not
> help.
>
>>
>> If you can't do the compares at the same time, which
>> gives you a smaller sample? Can you start with that
>> and then apply the other compare? Do they have to be
>> done in a specific order?
>
> I'm not sure what you are getting at here.  But the matches are not
> done in any order.
>
>>
>> If you have 20,000 records and 9,000 records and matching
>> them gives you 300,000 records, you have a serious cartesian
>> join going on. Is there any way you can reduce that?
>
> What you said is correct and that is my problem.  The 300,000 records
> are what I get when the join is done between the 20,000 Soundex values
> and the 9,000 Soundex values.  There really is no way to get around the
> 300,000 records.  The 300,000 records are valid matches according to
> the Soundex values.  Then I calculate the percent match between the
> names.  We need to record the highest percent match for those that
> match below 80% and store all matches that are 80% and above.
>
>>
>> Not knowing your data or your data structures, those are
>> my first ideas.
>>
>> Robin S.
>
Author
26 Nov 2006 9:28 AM
Cor Ligthert [MVP]
Michel,

It took me a while to find that what you wrote was what I wanted to advice
as well in a way.

Get the data using a datareader.
Process it in memory of a seperated client (or even on the server)
Set it back using the command.ExecuteNonQuery using an Insert command.

Be aware that SQL code even as it is a stored procedure is non builded code.
(although it seems that it can reuse some things on the fly).

Is it the same?

Cor

Show quoteHide quote
"Michel Posseth [MCP]" <M***@posseth.com> schreef in bericht
news:urUN%23%23SEHHA.4060@TK2MSFTNGP03.phx.gbl...
>
> Seems like the Bottle neck is the IO of the 300.000 records
>
>
> wich is actually not so much for SQL server , i wrote ones a proggy that
> inserted millions of records in a sql database  this was  12 GB mysql dump
> file i neede to do this import as fast as possible as it was production
> data coming from a third party .
>
>
> I tried everyhting , ODBC to a seperate mysql DB took 14 hours to complete
> , DTS wasn`t possible as the file contained SQL statements ( DDL , and
> inserts in MYSQL native format )  ....and ....... not possible
> ........Long story .......:-)
>
> In the end  i converted the data in memory , and created sperate SQL
> insert startements on the fly , i saved these statements in batches of
> 1000 inserts seperated by a ; ( dot comma  ; )  in a string builder   and
> ececuted the SQL on a command object , the hole proggy took now minutes to
> complete.
>
> so my advise batch your IO  in this way and you will see a hughe
> performance benefit
>
>
>   regards
>
> Michel Posseth
>
>
> "Paul" <pwh***@hotmail.com> schreef in bericht
> news:1164050777.701095.255670@f16g2000cwb.googlegroups.com...
>> Thanks for the reply Robin.  Below are my answers...
>>
>> RobinS wrote:
>>> Assuming it doesn't change unless the data stored therein
>>> changes, store the Soundex numbers for the 20,000 records
>>> instead of calculating it on the fly.
>>
>> I will consider that.  It would save time, but require more of a
>> reengineering since we do not store the Soundex value for future use.
>> We use it and then discard it.
>>
>>>
>>> How are you comparing the names? Are they identical?
>>> Could you do an inner join to pick up the ones that
>>> match and compare the Soundex values at the same time?
>>
>> 99.99% of the names (or 100%) will not match exactly.  So that will not
>> help.
>>
>>>
>>> If you can't do the compares at the same time, which
>>> gives you a smaller sample? Can you start with that
>>> and then apply the other compare? Do they have to be
>>> done in a specific order?
>>
>> I'm not sure what you are getting at here.  But the matches are not
>> done in any order.
>>
>>>
>>> If you have 20,000 records and 9,000 records and matching
>>> them gives you 300,000 records, you have a serious cartesian
>>> join going on. Is there any way you can reduce that?
>>
>> What you said is correct and that is my problem.  The 300,000 records
>> are what I get when the join is done between the 20,000 Soundex values
>> and the 9,000 Soundex values.  There really is no way to get around the
>> 300,000 records.  The 300,000 records are valid matches according to
>> the Soundex values.  Then I calculate the percent match between the
>> names.  We need to record the highest percent match for those that
>> match below 80% and store all matches that are 80% and above.
>>
>>>
>>> Not knowing your data or your data structures, those are
>>> my first ideas.
>>>
>>> Robin S.
>>
>
>