DATAMOUNTAIN.SQLMOUNTAIN.COM

The Fact Free Table

Colonel Ross still wore an expression which showed the poor opinion which he had formed of my companion's ability, but I saw by the inspector's face that his attention had been keenly aroused

"You consider that to be important?" he asked.

"Exceedingly so."

"Is there any point to which you would wish to draw my attention?"

"To the curious incident of the dog in the night-time."

... << MORE >>

SQL PASS Submissions

My submissions for SQL Pass..

Data Transformations with SQL and SSIS


ETL is easy right? Just move data from source A to target B… no problem, right? Actually the ETL process calls forth questions of Form and Substance that have puzzled the great thinkers of western culture, including Aristotle, Marcus Aurelius, Ignatius of Antioch, Thomas Aquinas and others through the ages. Mike will explore the methods for transforming data using both SQL Server Integration Services and T-SQL with occasional side tracks into the philosophical implications of ... << MORE >>

Project Management Musings

Any gathering of IT professionals (and by this I mean people who actually do the work on code or hardware) will include a gripe session about management.   This is as traditional as brides tossing wedding bouquets and conspiracy theories on the internet.   A good deal of this is simple kvetching.   For those of you who did not spend early parts of your career in Brooklyn, where kvetching is a fine art, kvetching is the habit of complaining about everything.   Nothing is ever good enough.   It is best summed up in the old ...

<< MORE >>

Coding Best Practices: A Rant

Every now and again you come across a person who makes assertions that are totally at odds with everything you thought you knew. In this case, a manager at a large software firm asserted with the certainty that the sun will rise in the east that code should NEVER be commented and documentation is not necessary for good software development. I have heard these assertions before but never with the conviction this guy brought to the debate. When someone makes such an assertion with ...

<< MORE >>

What the Partitioning Documentation Does Not Make Clear

I have been experimenting with table partitioning for certain high volume tables at my job.  

The table has a natural key of a data triplet but the primary key is an unusual twist of encrypting the data, converting the encryption to hexadecimal then concatenating the hex values as strings.  Two tables use this data and they get significant amounts of I/0 – a rate of over 100,000 rows per day inserted for each table.  As you might imagine the table has grown quite large and select statements are suffering, particularly with the join overhead. 

The hex values were the appropriate partition candidate.  As I read on MSDN that you can partition a heap, I thought I would replace the hex primary key with a surrogate key to minimize some of the index fragmentation we were seeing.  I would leave the hex values un-indexed or in a non-clustered index and institute the partition.

I read too much into the Microsoft documentation.  Once you establish a primary key, you must include it in the partition even if it is non-clustered. 

So to recap, creating a table as a simple heap would work, such as this:

 

CREATE TABLE dbo.HashPartitionTest

(

HashID int identity(1,1),

CornBeef_hash nvarchar(64) NOT NULL UNIQUE

)

ON PS_HashPartitions(CornBeef_hash);

go

 

Creating the table with the hash value included in the key would work, as follows

CREATE TABLE dbo.HashPartitionTest

(

HashID int identity(1,1),

CornBeef_hash nvarchar(64) NOT NULL UNIQUE,

CONSTRAINT PK_HashAndEggs PRIMARY KEY 

      (

      HashID, CornBeef_hash ASC

      )

)

ON PS_HashPartitions(CornBeef_hash);

go

 

However, this will not work

CREATE TABLE dbo.HashPartitionTest

(

HashID int identity(1,1),

CornBeef_hash nvarchar(64) NOT NULL UNIQUE,

CONSTRAINT PK_HashAndEggs PRIMARY KEY  nonclustered

      (

      HashID

      )

)

ON PS_HashPartitions(CornBeef_hash);

go

This will generate the error “Partition columns for a unique index must be a subset of the index key.”  It generates the same error when the index is clustered.

Lesson – I need to include the value in the primary key to partition this table.

Substring Situations

I got an interesting puzzle from a colleague yesterday.  Marketing wanted to know what versions of particular mobile devices people were using when they browsed out web site.    I suspect folks will be seeing requests like this more frequently.  Sales of portable and handheld devices are now surpassing sales of desktop and laptop computers.  Marketing wanted to know who was using which versions of a particular product (e.g. Android 2.1).

Whoever had “loaded” the database we were querying should be the second person up against the wall when the revolution comes.  The first will be whoever designed Mercury Quality Center – but I digress. 

Rather than shred the long string taken in by the web server, this person simply put this string into a single field in the database.  With the number of open source platforms running mobile devices, you can imagine the mess this created.  Even the “standard” of Droid is not really standard, as every manufacturer has their own flavor, and often have multiple flavors depending on the device and when it was released.  Combine this with I-Pads, I-phones, Windows Mobile, all of various versions and you can imagine the amount of confusion that can be sown with a single 128 character string.  The biggest issue we had is the lack of standardization around what constitutes a “version” of Android.  We saw “Android 2.1”, “Android 2.1 (Éclair)”, “Android Éclair” and so on.   Apparently someone at Google has a sweet tooth and it shows.  All of the builds of Android are named for desserts.

The versions of the OS were surrounded by other text of variable length, so the puzzle – how to extract what marketing needs and none of the extraneous noise that surrounded it?

A fairly typical string we saw was “Mozilla/5.0 (Linux; U; Android Eclair; en- Build/ECLAIR) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17”.  We did find that the OS versions always closed with the semi colon, so we need the substring from the start off the OS (“Android” in this case) to the subsequent semi-colon.

The T-SQL SUBSTRING function allows you to specify both a start and length of the string to extract the data you need, as in

select SUBSTRING('Call me Ishmael', @startHere, @goThisFar)

 The CHARINDEX function returns an integer representing the starting point of the string we seek.  At  this point it becomes an exercise in algebra. 

Start with the substring below

declare @phoneString nvarchar(max)

set @phoneString = 'Mozilla/5.0 (Linux; U; Android Eclair; en- Build/ECLAIR) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17'

 

The starting point for the substring we want can be found with

CHARINDEX('Android', @phoneString)

We want to extract everything up to the semi-colon so we find that position with

CHARINDEX(';', @phoneString,CHARINDEX('Android', @phoneString))

Remember, the SUBSTRING function takes a starting point and the length of string you want, not a starting and end point, so we need to use the following to determine the length:

CHARINDEX(';', @phoneString,CHARINDEX('Android', @phoneString)) - CHARINDEX('Android', @phoneString)

So, to wrap everything up in a neat package for marketing we would have

select SUBSTRING(@phoneString, CHARINDEX('Android', @phoneString),CHARINDEX(';', @phoneString,CHARINDEX('Android', @phoneString)) - CHARINDEX('Android', @phoneString))

 

It turns out that many of these versions of Android are the same.  In other words Android 2.1 = Android 2.1 (Éclair) = Android Éclair.  The better practice would be to create a dimension table to properly identify the various labels applied to the OS and give the users a consistent, standard, view of the data.

 And a big shout out to Tristen Pieh for giving me this post subject matter!

How to find objects that reference a single field

I have often been assigned the task of changing a particular field on a table.  Naturally, one needs to get a list of objects that could break with the change.  For example, changing a data type could break a stored procedure. 

When first assigned these tasks I would use the normal tools such as sp_depends for the parent table, then wade through the list of objects to see if any used the particular field I was about to change.

I was getting tired of the brute force way of doing things, so, I knuckled down to some research on some of the more obscure system views and discovered sys.sql_modules, a lovely object that holds the definitions of all of the stored procedures and views.  It is easily joined to the more familiar views for getting dependencies, resulting in the script below. 

One major snag here is the notorious inefficiency of the dependency objects.  You should expect that you will often find objects that no longer exist and other objects that have been created but have not found their way into the dependency tracking objects for one reason or another.  The only way around this little “feature” of SQL Server is to loop through each database and check definitions there.  That script is to follow.

SELECT distinct referencing_schema_name,

      referencing_entity_name,

      sre.referencing_id,

      sed.referenced_database_name

FROM sys.dm_sql_referencing_entities ('InYourDreams', 'OBJECT') sre

      inner join sys.sql_expression_dependencies sed on sre.referencing_id = sed.referencing_id

      inner join sys.sql_modules mod on mod.object_id = sed.referencing_id

WHERE  mod.definition like '%Need2Guess%'

order by  sed.referenced_database_name desc

GO

 

Visual Studio 2010 SP 1

If you are a fan of automating database builds through VS 2010 or using any other Visual Studio 2010 features, service pack one is now available at http://www.microsoft.com/downloads/en/details.aspx?FamilyID=75568aa6-8107-475d-948a-ef22627e57a5&displaylang=en

 

Learn StreamInsight In Denver

For those of you in Denver there will be an interesting SQL Server Users Group meeting at the Microsoft office in DTC on 17 March covering the basics of StreamInsight.  Details can be found at http://denver.sqlpass.org/default.aspx

 

Cursor vs. While Loop Smackdown!

First, my apologies for the lack of blog entries here.  I had surgery on my shoulder a few weeks ago and have been more focused on recovery than blog entries. 

A few weeks before my surgery someone remarked to me that they hated ETL.  “Just hate it!” he said.  I had a similar experience at the Rocky Mountain tech trifecta last weekend.  This struck me as similar to a carpenter proclaiming “I hate screwdrivers!  Just hate them!  They all suck!”  The attitude is … well just silly.

“No cursors!  Ever!” proclaimed one of the speakers.  This struck me as slightly less extreme, but still silly.  Before the few readers I have completely shut me down as a fool, I have often found that cursors are used by a crutch by junior developers or folks who do not understand SQL or set operations.  Typically someone comes along with the management magic wand, bangs it on the head of some unlucky .NET or Java developer and proclaims

“THOU ART NOW A DATABASE DEVELOPER!  USE YOUR POWER ONLY FOR GOOD!” 

You can actually hear the capitalization, larger font and red letters.  Of course, no training is forthcoming and this poor person is expected to learn on their own.  The result is that they turn to the tools they are familiar with for handling data, which all too often rely heavily on looping structures.  Those of us with minor advantages, like training, mentors and experience, end up reviewing their code and either laughing out loud or cringing in agony.  This is particularly the case when we observe the over-reliance on cursors.

SO I fully recognize that cursors are a crutch.  That said, as someone recovering from surgery, I also recognize that there are times when a crutch (or a sling in my case) is required. 

In my case this came when I needed to update data on a heavily used transactional system.  Any change we made had to be transparent to the user and we could not have any adverse impact on the customer facing web site.  Also we had no maintenance window.  If we used a single set operation we would have been faced with a significant lock escalation issue with a potentially huge adverse impact to the customer’s interaction with the web site – actually that should be read as an adverse impact on the business’s ability to collect money from the customer.  As you might imagine, that was simply not acceptable.   Therefore, we had to break the data down into small easily digested bites (pun intended).

This produced an internal debate at the office.  Should the looping structure use a cursor or a while loop? 

It turns out that is not such a simple question to answer.  What data will you look at to decide this question?  In this case, we need to worry about interfering with the users, so we want to minimize system I/O.

Let’s examine this in miniature.  I created a demo table and populated it with the simple script below:

use demo

go

create table dbo.CursorWhileSmackdown

(

      smackID int identity(1,1) primary key,

      someRandomNumber int,

      someRandomGUID uniqueidentifier

)

go

declare @i int

set @i = 0

while @i <= 10000

BEGIN --loop

      set @i = @i + 1

      insert into dbo.CursorWhileSmackdown (someRandomNumber, someRandomGUID)

      Values (RAND() *100, NEWID())

END --loop

CREATE NONCLUSTERED INDEX [IX_Random] ON [dbo].[CursorWhileSmackdown]

(

      [someRandomNumber] ASC

)WITH (STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

GO

 

As you can see this populates the table with 10,000 random records. 

Consider the simple example first – update every row with a new random number and a new unique identifier value.  For a cursor the code would look like this:

declare @start datetime, @end datetime

set @start = GETDATE()

declare @smackID int

declare curseLoop cursor FAST_FORWARD for select smackID from dbo.CursorWhileSmackdown

open curseLoop

FETCH NEXT FROM curseLoop into @smackID

While @@FETCH_STATUS = 0

BEGIN

      update dbo.CursorWhileSmackdown set someRandomGUID = NEWID(), someRandomNumber = RAND()

      where smackID  = @smackID

      FETCH NEXT FROM curseLoop into @smackID

END

close curseLoop

Deallocate curseLoop

set @end = GETDATE()

select DATEDIFF(MS, @start, @end) as 'elapsed time in MS)', @start as StartTime, @end as EndTime

go

Since we have an identity value in the primary key, the While loop would be a simple counter through from 0 to the maximum value as  below:

declare @start datetime, @end datetime

set @start = GETDATE()

declare @smackID int, @rows int

set @smackID = 0

select @rows = MAX(smackID) from dbo.CursorWhileSmackdown

while @smackID <= @rows

BEGIN

      update  dbo.CursorWhileSmackdown

      set someRandomGUID = NEWID(), someRandomNumber = RAND()

      where smackID = @smackID

      set @smackID = @smackID + 1

END

set @end = GETDATE()

select DATEDIFF(MS, @start, @end) as 'elapsed time in MS)', @start as StartTime, @end as EndTime

go

In both cases I used profiler to measure the CPU utilization, reads and writes and found

 

CURSOR

WHILE

CPU

1232

874

Writes

39

2

Reads

30061

20036

Duration in milliseconds

2356

1666

 

In this simple case the while loop is superior by every measure.  Unfortunately we are rarely met with such a clean method for looping through data.  We will typically need to break data down logically and not have the luxury of simply looping through an identity value from 1 to N.  To simulate this consider an update on the random number value.  For the cursor, the code might be:

declare @start datetime, @end datetime

set @start = GETDATE()

declare @smackID int

declare curseLoop cursor FAST_FORWARD for select someRandomNumber from dbo.CursorWhileSmackdown

open curseLoop

FETCH NEXT FROM curseLoop into @smackID

While @@FETCH_STATUS = 0

BEGIN

      update dbo.CursorWhileSmackdown set someRandomGUID = NEWID()

      where  someRandomNumber = @smackID

      FETCH NEXT FROM curseLoop into @smackID

END

close curseLoop

Deallocate curseLoop

set @end = GETDATE()

select DATEDIFF(MS, @start, @end) as 'elapsed time in MS)', @start as StartTime, @end as EndTime

go

The while loop is more complex and requires the use of a temp table:

declare @start datetime, @end datetime

set @start = GETDATE()

declare @smackID int, @rows int, @i int

create table #While (smackID int, someRandomNumber int)

insert into #While(smackID, someRandomNumber) (select smackID, someRandomNumber from dbo.CursorWhileSmackdown)

set @rows = @@ROWCOUNT

create clustered index IX_SmackID on #While (someRandomNumber)

set @i = 0

while @i <= @rows

BEGIN

     

      select @smackID = MIN(someRandomNumber) from #While where smackID = @i

     

     

      update  dbo.CursorWhileSmackdown

      set someRandomGUID = NEWID()

      where someRandomNumber = @smackID

      set @i = @i + 1

END

set @end = GETDATE()

select DATEDIFF(MS, @start, @end) as 'elapsed time in MS)', @start as StartTime, @end as EndTime

go

 

As you can see the code is more clumsy and this shows in the results

 

CURSOR

WHILE

CPU

490608

503602

Writes

11363

11368

Reads

213221958

213543402

Duration in milliseconds

599566

616716

 

So here the cursor is superior.  The bottom line is that you need to test and measure performance by your goals with data that is a realistic match or subset of your production system.