|
web
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Large text file import: MVP questionHi
I need to read in a large set of text files (9GB+ each) into a database table based on fixed width lengths. There are several ways to complete this, but I am wondering if anyone has insight into the FASTEST way to read in the data and chop it up ahead of posting it into the DB. So far, things work, but they are far slower than expected. Do I need to consider a VC++ app to get a large improvement here? thx Avi MVP Question ,,, what is MVP about it ????
As you do not provide enough information it is impossiible to give you the answer you probably expect . nice things to know are : how is the data delivered ? ( are conversions necessary) what is the target database what do you currently use etc etc etc To give you an idea what is possible : I had one year ago the task to import MYSQL dump files to MS SQL server 2000 these files were + - 7 GB big and contained the MYSQL dialect so using DTS was not an option as i had a few million data rows and then a block of DDL SQL with the MYSQL dialect etc etc etc so i wrote a simple tool in VB.Net 2005 containing a file reader a stringbuilder and a SQL command object , i read the file line by line and first searched all DDL SQL , i converted this SQL Dialect to MS T-SQL and execuited it on the command object ( so now i had all the required tables ) , now i read the file again and read in all data blocks ( Insert statements, that were translated on the fly ) i executed these lines in batches of 1000 rows in a time the hole import took 3 minutes ,,, my collegue setup a MYSQL server and connected with ODBC the same operation took 4 Hours :-) so i hope to have given you some ideas regards Michel Posseth [MCP] Show quoteHide quote "Avi" <A**@discussions.microsoft.com> schreef in bericht news:203EE560-2A98-4885-9E45-F84849265598@microsoft.com... > Hi > > I need to read in a large set of text files (9GB+ each) into a database > table based on fixed width lengths. > > There are several ways to complete this, but I am wondering if anyone has > insight into the FASTEST way to read in the data and chop it up ahead of > posting it into the DB. > > So far, things work, but they are far slower than expected. > > Do I need to consider a VC++ app to get a large improvement here? > > thx > Avi Read my response to Cor. This is very MVP.
Show quoteHide quote "Michel Posseth [MCP]" wrote: > > MVP Question ,,, what is MVP about it ???? > > As you do not provide enough information it is impossiible to give you the > answer you probably expect . > > nice things to know are : > > how is the data delivered ? ( are conversions necessary) > what is the target database > what do you currently use > etc etc etc > > To give you an idea what is possible : > > I had one year ago the task to import MYSQL dump files to MS SQL server 2000 > these files were + - 7 GB big and contained the MYSQL dialect > so using DTS was not an option as i had a few million data rows and then a > block of DDL SQL with the MYSQL dialect etc etc etc > > so i wrote a simple tool in VB.Net 2005 containing a file reader a > stringbuilder and a SQL command object , i read the file line by line and > first searched all DDL SQL , i converted this SQL Dialect to MS T-SQL and > execuited it on the command object ( so now i had all the required tables ) > , now i read the file again > and read in all data blocks ( Insert statements, that were translated on > the fly ) i executed these lines in batches of 1000 rows in a time > the hole import took 3 minutes ,,, my collegue setup a MYSQL server and > connected with ODBC the same operation took 4 Hours :-) > > > so i hope to have given you some ideas > > regards > > Michel Posseth [MCP] > > > "Avi" <A**@discussions.microsoft.com> schreef in bericht > news:203EE560-2A98-4885-9E45-F84849265598@microsoft.com... > > Hi > > > > I need to read in a large set of text files (9GB+ each) into a database > > table based on fixed width lengths. > > > > There are several ways to complete this, but I am wondering if anyone has > > insight into the FASTEST way to read in the data and chop it up ahead of > > posting it into the DB. > > > > So far, things work, but they are far slower than expected. > > > > Do I need to consider a VC++ app to get a large improvement here? > > > > thx > > Avi > > > > Read my response to Cor. This is very MVP. No it isn`t , I have never seen a MVP signature at a post of Francesco Balena for instance :-) Having read the answer of Cor, i can only fully concur with him By the way ,,, i believe i have earned my stripes as a professional programmer ( made programs for thousands of users throughout europe ) in my previous Job i worked at a Automotive data provider ( the market leader in there business ) where i was not only a lead developer , but also the sql server ( 2000 , 2005 ) database administrator . One of my responsability`s was optimizing data conversions ( imports ) from external delivered data ( Mysql , flat text files etc etc ) I worked only with Poweredges , with at least 8 gigs of mem , in raid 10 ( and no i did not have one of them we had i believe 25 of those beasts in our server room , besides the other servers the companny had ) So finished beating my chest , i can tell you that nobody MVP, MCP , MCAD , MCSD or Steve and / or even Bill himself can`t answer your question correctly unless you provide some detailed info and you did not do that ... so ..... regards Michel Posseth just a Software developer Show quoteHide quote "Avi" wrote: > Read my response to Cor. This is very MVP. > > > "Michel Posseth [MCP]" wrote: > > > > > MVP Question ,,, what is MVP about it ???? > > > > As you do not provide enough information it is impossiible to give you the > > answer you probably expect . > > > > nice things to know are : > > > > how is the data delivered ? ( are conversions necessary) > > what is the target database > > what do you currently use > > etc etc etc > > > > To give you an idea what is possible : > > > > I had one year ago the task to import MYSQL dump files to MS SQL server 2000 > > these files were + - 7 GB big and contained the MYSQL dialect > > so using DTS was not an option as i had a few million data rows and then a > > block of DDL SQL with the MYSQL dialect etc etc etc > > > > so i wrote a simple tool in VB.Net 2005 containing a file reader a > > stringbuilder and a SQL command object , i read the file line by line and > > first searched all DDL SQL , i converted this SQL Dialect to MS T-SQL and > > execuited it on the command object ( so now i had all the required tables ) > > , now i read the file again > > and read in all data blocks ( Insert statements, that were translated on > > the fly ) i executed these lines in batches of 1000 rows in a time > > the hole import took 3 minutes ,,, my collegue setup a MYSQL server and > > connected with ODBC the same operation took 4 Hours :-) > > > > > > so i hope to have given you some ideas > > > > regards > > > > Michel Posseth [MCP] > > > > > > "Avi" <A**@discussions.microsoft.com> schreef in bericht > > news:203EE560-2A98-4885-9E45-F84849265598@microsoft.com... > > > Hi > > > > > > I need to read in a large set of text files (9GB+ each) into a database > > > table based on fixed width lengths. > > > > > > There are several ways to complete this, but I am wondering if anyone has > > > insight into the FASTEST way to read in the data and chop it up ahead of > > > posting it into the DB. > > > > > > So far, things work, but they are far slower than expected. > > > > > > Do I need to consider a VC++ app to get a large improvement here? > > > > > > thx > > > Avi > > > > > > =?Utf-8?B?QXZp?= <A**@discussions.microsoft.com> wrote in
news:203EE560-2A98-4885-9E45-F84849265598@microsoft.com: You can do this via SQL BCP, Bulk Inserts or DTS.> There are several ways to complete this, but I am wondering if anyone > has insight into the FASTEST way to read in the data and chop it up > ahead of posting it into the DB. > We have a project in which we build a dynamic DTS package via VB.NET to do imports. The advantage of this is that the DTS graph can be dynamically built to handle a variety of file layouts. DTS can also handle special characters during import - something Bulk Inserts doesn't do. With BCP you'll need to create a format file which is a bit of a hassle if your file import specifications change a lot. Thanks for the response - but DTS is not an option.
Show quoteHide quote "Spam Catcher" wrote: > =?Utf-8?B?QXZp?= <A**@discussions.microsoft.com> wrote in > news:203EE560-2A98-4885-9E45-F84849265598@microsoft.com: > > > There are several ways to complete this, but I am wondering if anyone > > has insight into the FASTEST way to read in the data and chop it up > > ahead of posting it into the DB. > > > > You can do this via SQL BCP, Bulk Inserts or DTS. > > We have a project in which we build a dynamic DTS package via VB.NET to do > imports. The advantage of this is that the DTS graph can be dynamically > built to handle a variety of file layouts. DTS can also handle special > characters during import - something Bulk Inserts doesn't do. With BCP > you'll need to create a format file which is a bit of a hassle if your file > import specifications change a lot. > =?Utf-8?B?QXZp?= <A**@discussions.microsoft.com> wrote in news:126A0FF0-
D357-4EFA-BC8B-1E5DDFA04***@microsoft.com: > Thanks for the response - but DTS is not an option. Any particular reason?If not - guess you're stuck with BCP or Bulk Inserts. Avi,
For this kind of question you can better ask a plumber to give you the answer, water cannot go faster throug a pipe than it goes through the smallest part of the pipe. The same is with data. So in my opinion you should always concentrate on that smallest part. Any other optimisation is waste of time. That the workstation software is to slow, is in my idea seldom or you should have old workstations P3 < 600Mhz. That the newwork trafic is slow is no bad chance and is completely dependend from the typologic of the network and its throughput speed mostly you are with big chunks faster in trouble. That the server is the place which is in trouble. That is the place were I would look first, I would try to bring as much work as possible to the client. Where the last part as Michael wrote is exactly as I think about it. This course assuming that it is not Citrix or other Terminal Server. Just my thought, Cor Show quoteHide quote "Avi" <A**@discussions.microsoft.com> schreef in bericht news:203EE560-2A98-4885-9E45-F84849265598@microsoft.com... > Hi > > I need to read in a large set of text files (9GB+ each) into a database > table based on fixed width lengths. > > There are several ways to complete this, but I am wondering if anyone has > insight into the FASTEST way to read in the data and chop it up ahead of > posting it into the DB. > > So far, things work, but they are far slower than expected. > > Do I need to consider a VC++ app to get a large improvement here? > > thx > Avi With all due respect to each of the respondents who have assumed that I am
asking a simple question, let me be more explicit: I am not interested in a simple asnwer - been there, done that : SQL BCP/ DTS, filest/stream readers/writers splits, sub strings , mids - the whole nine yards. I am testing this on a quad xeon with 4gb ram, dual fibre channel san array - not some small, home PC. Network issues have been accounted for (and are not part of the baseline improvement that I am trying to find). The issue is simply this: what is the fastest way to read in a fixed width text file and split it apart by its constituent fields. don't worry about the return to the DB, thats irrevelant for now. So far the benchmark has been ok - but its not brilliant. the disk I/O is well short of the capabilities. cpu and memory are barely in use. That means either our code is crap or the .net capabilitiy is crap. We have simplified our code to a simple basic test. So, now I'm focusing on the underlying .net solutions. I'm looking for real performance answers. If you don't have them - fine, I'll look elsewhere. I just thought a high end question would require a high end programmer with high end experience. Hope this clears it up. If not, perhaps you can recommend a plumber who might know something....... :) Avi Show quoteHide quote "Cor Ligthert [MVP]" wrote: > Avi, > > For this kind of question you can better ask a plumber to give you the > answer, water cannot go faster throug a pipe than it goes through the > smallest part of the pipe. > > The same is with data. So in my opinion you should always concentrate on > that smallest part. Any other optimisation is waste of time. > > That the workstation software is to slow, is in my idea seldom or you should > have old workstations P3 < 600Mhz. > > That the newwork trafic is slow is no bad chance and is completely dependend > from the typologic of the network and its throughput speed mostly you are > with big chunks faster in trouble. > > That the server is the place which is in trouble. That is the place were I > would look first, I would try to bring as much work as possible to the > client. > > Where the last part as Michael wrote is exactly as I think about it. > > This course assuming that it is not Citrix or other Terminal Server. > > Just my thought, > > Cor > > "Avi" <A**@discussions.microsoft.com> schreef in bericht > news:203EE560-2A98-4885-9E45-F84849265598@microsoft.com... > > Hi > > > > I need to read in a large set of text files (9GB+ each) into a database > > table based on fixed width lengths. > > > > There are several ways to complete this, but I am wondering if anyone has > > insight into the FASTEST way to read in the data and chop it up ahead of > > posting it into the DB. > > > > So far, things work, but they are far slower than expected. > > > > Do I need to consider a VC++ app to get a large improvement here? > > > > thx > > Avi > > > Avi,
And still the answer should be simple. Simple solutions give fast processing applications. One thing you have to keep in mind. Although there can be one ILS instruction than still can this mean thousand times of looping. The 8086 processor famillie and its descendants are not that advanced that they can do things in one cycle so behind your code is a lot of looping or whatever. All methods from the Visual Basic Namespace are therefore using more througput time than from the basic Net namespace. Normally no problem, humans cannot see milliseconds. The second thing you have to keep in mind because that you are saying splitting that the normal string is not mutable. This means that every change, with mid, split, tolower, or whatever will result in a new string. However overdoing this can of course cost you a lot of time. But giving you any reasonable answer withouth seeing even a piece of your current code is in my opinion impossible. If it was, then there would not have been so many possibilities. By the way, you have selected the chance on an answer very much by writting that this is an MVP question. Probably somebody who is dedicated to your problem however not a MVP has said......................................................... Two words mostly said with the middle finger in the air. I hope this helps something, Cor Show quoteHide quote "Avi" <A**@discussions.microsoft.com> schreef in bericht news:EED05933-B9B2-454C-96D4-4162C63B97BB@microsoft.com... > With all due respect to each of the respondents who have assumed that I am > asking a simple question, let me be more explicit: > > I am not interested in a simple asnwer - been there, done that : SQL BCP/ > DTS, filest/stream readers/writers splits, sub strings , mids - the whole > nine yards. > > I am testing this on a quad xeon with 4gb ram, dual fibre channel san > array > - not some small, home PC. Network issues have been accounted for (and are > not part of the baseline improvement that I am trying to find). > > The issue is simply this: what is the fastest way to read in a fixed width > text file and split it apart by its constituent fields. don't worry about > the > return to the DB, thats irrevelant for now. So far the benchmark has been > ok > - but its not brilliant. the disk I/O is well short of the capabilities. > cpu > and memory are barely in use. That means either our code is crap or the > .net > capabilitiy is crap. > > We have simplified our code to a simple basic test. So, now I'm focusing > on > the underlying .net solutions. > > I'm looking for real performance answers. If you don't have them - fine, > I'll look elsewhere. I just thought a high end question would require a > high > end programmer with high end experience. > > Hope this clears it up. If not, perhaps you can recommend a plumber who > might know something....... :) > > Avi > > > "Cor Ligthert [MVP]" wrote: > >> Avi, >> >> For this kind of question you can better ask a plumber to give you the >> answer, water cannot go faster throug a pipe than it goes through the >> smallest part of the pipe. >> >> The same is with data. So in my opinion you should always concentrate on >> that smallest part. Any other optimisation is waste of time. >> >> That the workstation software is to slow, is in my idea seldom or you >> should >> have old workstations P3 < 600Mhz. >> >> That the newwork trafic is slow is no bad chance and is completely >> dependend >> from the typologic of the network and its throughput speed mostly you are >> with big chunks faster in trouble. >> >> That the server is the place which is in trouble. That is the place were >> I >> would look first, I would try to bring as much work as possible to the >> client. >> >> Where the last part as Michael wrote is exactly as I think about it. >> >> This course assuming that it is not Citrix or other Terminal Server. >> >> Just my thought, >> >> Cor >> >> "Avi" <A**@discussions.microsoft.com> schreef in bericht >> news:203EE560-2A98-4885-9E45-F84849265598@microsoft.com... >> > Hi >> > >> > I need to read in a large set of text files (9GB+ each) into a database >> > table based on fixed width lengths. >> > >> > There are several ways to complete this, but I am wondering if anyone >> > has >> > insight into the FASTEST way to read in the data and chop it up ahead >> > of >> > posting it into the DB. >> > >> > So far, things work, but they are far slower than expected. >> > >> > Do I need to consider a VC++ app to get a large improvement here? >> > >> > thx >> > Avi >> >> >> Thanks for the incoherent response..... I guess if I look beyond the
defensive language, I'll find something useful.... Oh. I can't. In many other posts to other newsgroups, if I have not posted enough info, I simply get asked for it. I have never before been so acosted for asking a question the wrong way. I think thats the idea of newsgroups in the first place. Time to get off your high perch. I'm not perfect. And I'll bet you aren't either - MVP or not. Show quoteHide quote "Cor Ligthert [MVP]" wrote: > Avi, > > And still the answer should be simple. Simple solutions give fast processing > applications. > > One thing you have to keep in mind. Although there can be one ILS > instruction than still can this mean thousand times of looping. The 8086 > processor famillie and its descendants are not that advanced that they can > do things in one cycle so behind your code is a lot of looping or whatever. > All methods from the Visual Basic Namespace are therefore using more > througput time than from the basic Net namespace. Normally no problem, > humans cannot see milliseconds. > > The second thing you have to keep in mind because that you are saying > splitting that the normal string is not mutable. This means that every > change, with mid, split, tolower, or whatever will result in a new string. > However overdoing this can of course cost you a lot of time. > > But giving you any reasonable answer withouth seeing even a piece of your > current code is in my opinion impossible. If it was, then there would not > have been so many possibilities. > > By the way, you have selected the chance on an answer very much by writting > that this is an MVP question. Probably somebody who is dedicated to your > problem however not a MVP has > said......................................................... Two words > mostly said with the middle finger in the air. > > I hope this helps something, > > Cor > > > > "Avi" <A**@discussions.microsoft.com> schreef in bericht > news:EED05933-B9B2-454C-96D4-4162C63B97BB@microsoft.com... > > With all due respect to each of the respondents who have assumed that I am > > asking a simple question, let me be more explicit: > > > > I am not interested in a simple asnwer - been there, done that : SQL BCP/ > > DTS, filest/stream readers/writers splits, sub strings , mids - the whole > > nine yards. > > > > I am testing this on a quad xeon with 4gb ram, dual fibre channel san > > array > > - not some small, home PC. Network issues have been accounted for (and are > > not part of the baseline improvement that I am trying to find). > > > > The issue is simply this: what is the fastest way to read in a fixed width > > text file and split it apart by its constituent fields. don't worry about > > the > > return to the DB, thats irrevelant for now. So far the benchmark has been > > ok > > - but its not brilliant. the disk I/O is well short of the capabilities. > > cpu > > and memory are barely in use. That means either our code is crap or the > > .net > > capabilitiy is crap. > > > > We have simplified our code to a simple basic test. So, now I'm focusing > > on > > the underlying .net solutions. > > > > I'm looking for real performance answers. If you don't have them - fine, > > I'll look elsewhere. I just thought a high end question would require a > > high > > end programmer with high end experience. > > > > Hope this clears it up. If not, perhaps you can recommend a plumber who > > might know something....... :) > > > > Avi > > > > > > "Cor Ligthert [MVP]" wrote: > > > >> Avi, > >> > >> For this kind of question you can better ask a plumber to give you the > >> answer, water cannot go faster throug a pipe than it goes through the > >> smallest part of the pipe. > >> > >> The same is with data. So in my opinion you should always concentrate on > >> that smallest part. Any other optimisation is waste of time. > >> > >> That the workstation software is to slow, is in my idea seldom or you > >> should > >> have old workstations P3 < 600Mhz. > >> > >> That the newwork trafic is slow is no bad chance and is completely > >> dependend > >> from the typologic of the network and its throughput speed mostly you are > >> with big chunks faster in trouble. > >> > >> That the server is the place which is in trouble. That is the place were > >> I > >> would look first, I would try to bring as much work as possible to the > >> client. > >> > >> Where the last part as Michael wrote is exactly as I think about it. > >> > >> This course assuming that it is not Citrix or other Terminal Server. > >> > >> Just my thought, > >> > >> Cor > >> > >> "Avi" <A**@discussions.microsoft.com> schreef in bericht > >> news:203EE560-2A98-4885-9E45-F84849265598@microsoft.com... > >> > Hi > >> > > >> > I need to read in a large set of text files (9GB+ each) into a database > >> > table based on fixed width lengths. > >> > > >> > There are several ways to complete this, but I am wondering if anyone > >> > has > >> > insight into the FASTEST way to read in the data and chop it up ahead > >> > of > >> > posting it into the DB. > >> > > >> > So far, things work, but they are far slower than expected. > >> > > >> > Do I need to consider a VC++ app to get a large improvement here? > >> > > >> > thx > >> > Avi > >> > >> > >> > > > Email me off list (ad***@kjmsolutions.com ). I may have a solution for you
however will require a dll. Kelly -- Show quoteHide quoteGet a powerful web, database, application, and email hosting with KJM Solutions http://www.kjmsolutions.com "Avi" <A**@discussions.microsoft.com> wrote in message news:203EE560-2A98-4885-9E45-F84849265598@microsoft.com... > Hi > > I need to read in a large set of text files (9GB+ each) into a database > table based on fixed width lengths. > > There are several ways to complete this, but I am wondering if anyone has > insight into the FASTEST way to read in the data and chop it up ahead of > posting it into the DB. > > So far, things work, but they are far slower than expected. > > Do I need to consider a VC++ app to get a large improvement here? > > thx > Avi On Sun, 7 May 2006 07:46:01 -0700, Avi <A**@discussions.microsoft.com> wrote:
¤ Hi ¤ ¤ I need to read in a large set of text files (9GB+ each) into a database ¤ table based on fixed width lengths. ¤ ¤ There are several ways to complete this, but I am wondering if anyone has ¤ insight into the FASTEST way to read in the data and chop it up ahead of ¤ posting it into the DB. ¤ ¤ So far, things work, but they are far slower than expected. ¤ ¤ Do I need to consider a VC++ app to get a large improvement here? I don't know what type of database you are working with but the fastest methods are typically the native database bulk operations as Spam Catcher mentioned. If you process a 9 GB file line by line it's going to take a while regardless of what programmatic method you use. Paul ~~~~ Microsoft MVP (Visual Basic) |
|||||||||||||||||||||||