Quantcast
Channel: Helyar.net
Viewing all articles
Browse latest Browse all 15

SQL Server random unique identifiers

$
0
0

A common method of producing random unique identifiers in SQL server is by using a GUID field, calling newid() to generate the data. For the most part, this works because it’s 128 bits worth of random data, which means there is a very low probability of duplicate records for most databases.

However, it is also common to combine this with the checksum() function to reduce it to a 32 bit integer. This makes collisions much more likely, even in relatively small databases. For example, the GUIDs 28258F69-6536-4198-BE37-94960ABF054F and 49B60D4B-DC4A-4E18-825E-B4C99713D011 both checksum to 0xC3AD13D3. With a table of about 100,000 rows collisions will start to occur more frequently by the birthday paradox.

def P(x, n):
  return 1 - ((x-1)/x)**(n * (n-1)/2)
 
>>> P(2**32,77500)
0.503022489601693
 
>>> P(2**32,200000)
0.9905011979711653
 
>>> P(2**53,10000000)
0.005535735601501335
 
>>> P(2**53,100000000)
0.42599174817345076

Using this maths, we can see that using a 32 bit random number, the probability of getting at least one collision is 50% at around 77,500 rows and 99% at 200,000 rows. We can also see that if we increase this to a 53 bit number, 10 million rows gives a 0.55% chance of getting at least collision and 100 million rows gives a 42.5% chance of getting at least one collision, so 64 bit should be plenty.

For higher precision numbers, we can use mpmath

from mpmath import *
mp.dps=50
def P(x, n):
  return 1 - ((x-1)/x)**(n * (n-1)/2)
 
>>> print( P(mpf(2)**mpf(64), mpf(1000000000)) )
0.026741008873954667138505944308767166153477279783999
>>> print( P(mpf(2)**mpf(64), mpf(10000000000)) )
0.93349681454859684590636893705314709163709804436875

From here you can see that a 64 bit number has a 2.6% chance of getting a single collision in a 1 billion row table, and a 93% chance in a 10 billion row table.

A compromise of both is to simply truncate the GUID at 64 bits and optionally convert to a bigint.

SELECT CONVERT(BIGINT, CONVERT(BINARY(8), newid()))

If you leave it as binary and don’t need to convert to an integer type, this does not have to be 8 bytes. For example, you could have a 5 or a 10 byte code.

None of these are perfect but the probability of a collision decreases with more bits. If 128 bit is too long for you (e.g. to display to users) but 32 bit generates too many collisions, try a compromise such as 64 bit.

If you are consistent enough, you may even be able to store the original GUID and just display the truncated form, which could allow you to change the length displayed later without changing the probability of collisions. This is more flexible but may lead to confusion among users and consistency is required (differing lengths could lead to bugs).


Viewing all articles
Browse latest Browse all 15

Trending Articles