Don't uppercase/lowercase Base64 encoded MD5 Strings

I have been working on a different part of the codebase and essentially a different team this last sprint. The work consisted of porting over an endpoint to a new version of an REST API. I had made some code updates to some mapping logic which upper cased a value of a specific property. This property is normally a UUID/GUID represented as a string. Which is fine, since the character set is hexadecimal. The string value upper or lower case will initialize to an object with the same values.

from uuid import UUID
output_lower = UUID('b70ef3ab-d77e-4a1f-ae33-fca12c9b2696')
output_upper = UUID('B70EF3AB-D77E-4A1F-AE33-FCA12C9B2696')
assert output_lower == output_upper

using System;
using System.Diagnostics;
public class GuidCheck
{
    public static void Main(string[] args)
    {
        var a = new Guid("b70ef3ab-d77e-4a1f-ae33-fca12c9b2696");
        var b = new Guid("B70EF3AB-D77E-4A1F-AE33-FCA12C9B2696");
        Debug.Assert(a == b);
    }
}

In a particular case, the value looks to be a base64 encoded string (not actual value): gdyb21LQTcIANtvYMT7QVQ==

The trailing = are base64 padding characters. I've written about base64 encoding before. If so, upper-casing this would not be appropriate since multiple inputs could provide the same output. For this string it would be exactly 524288 possible input strings. Since there are 19 letters which could be upper or lower cased and 1 digit.

$$\displaylines{ \\ \text{Total Combinations} = 2 \times 2 \times 2 \times 2 \times 2 \times 2 \times 2 \times 2 \\ \times 2 \times 2 \times 2 \times 2 \times 2 \times 2 \times 2 \\\times 2 \times 2 \times 2 \times 2 \\ \times 1 }$$

$$\displaylines{ \text{Total Combinations} = 2^{19} \times 1^{2} \\ \text{Total Combinations} = 2^{19} \\ \text{Total Combinations} = 524288 }$$

import itertools
def generate_combinations(s) -> set:
    chars, pos_chars, count = [*s], [], 0

    for char in chars:
        if char.isalpha():
            count += 1
            pos_chars.append([char.lower(), char.upper()])
        else:
            pos_chars.append([char])

    combinations = itertools.product(*pos_chars)
    result = [''.join(combination) for combination in combinations]

    assert len(result) == 2**(count) # Should be equal.
    return result

string = "gdyb21LQTcIANtvYMT7QVQ=="
all_combinations = generate_combinations(string)

print(len(all_combinations))
#print(all_combinations)

This also got me curious as to what the actual values are. I have suspicions that it could be a hash. Decoding this with online base64 decoders did not yield anything in the ASCII range. That's when I decided that I have to take a deeper dive.

The gdyb21LQTcIANtvYMT7QVQ== represents 132-bits of data since there are 24 characters and Base64 characters represent 6-bits instead of 8-bits (ASCII).

Every 4 base64 characters represent 3 bytes (or 24 bits) of the original data. Since we have 22 characters, we can divide by 4 to find out how many groups of 4 characters we have:

$$\frac{22}{4} = 5.5$$

With 5 groups of 4 characters = 5 * 3 bytes = 15 bytes. The remaining 2 base64 characters represent 1 additional byte (since 2 characters provide 12-bits, and base64 encoding works in 6-bit chunks, this adds up to 1.5 bytes, but since it can't be half a byte, it indicates 1 byte in the decoding process).

The original string would be 16 bytes (15 bytes + 1 byte). Since each byte is 8-bits we get 128-bits for the original decoded string value. That is how Jon Skeet got to that the declaration in this StackOverflow question.

So the original string is 128-bits so it could be the following candidates:

Decode

Some of them contain the typical +, -, =, / characters, but others would contain _ which isn't part of the character set which would yield:
binascii.Error: Invalid base64-encoded string: number of data characters (21) cannot be 1 more than a multiple of 4

The input string is is a multiple of 4, but the _ is breaking the decoding. Looks like the string is base64 URL safe encoded.

https://gist.github.com/Wind010/76902a5ea3f48287e34829e9bca67ff9

UUID

A Universally Unique Identifier (UUID) or Globally Unique Identifier (GUID) is a 128-bit string, nominal object. Useful for when you want to have a unique ID without a centralized tracking system. The probability of generating a duplicate UUID/GUID or one colliding with another is close enough to be negligible and can be shown with the equation for the birthday problem.

$$\displaylines{ P(\text{no collision}) \approx e^{-\frac{n^2}{2N}} \\ P(\text{collision}) = 1 - P(\text{no collision}) \approx 1 - e^{-\frac{n^2}{2N}} }$$

If we generate 1-trillion UUIDs (e12), n = 1*10^12 , and if we use UUID version 4 which has 122-bits of randomness (4 bits are used to indicate the version, and another 2 for the variant), then N = 2^122:

$$\displaylines{ P(\text{collision}) \approx 1 - e^{-\frac{(10^{12})^2}{2 \cdot 2^{122}}} \\ P(\text{collision}) \approx 1 - e^{-\frac{10^{24}}{2 \cdot 2^{122}}} \\ P(\text{collision}) \approx 1 - e^{-\frac{10^{24}}{2^{123}}} \\ P(\text{collision}) \approx 1 - e^{-\frac{10^{24}}{2 \cdot 1.06 \times 10^{37}}} \\ P(\text{collision}) \approx 1 - e^{-9.71 \times 10^{-14}} }$$

Where 2^123 is approximately 1.06 x 10^37. Given that e^(-9.71x10^-14) is small, we can approximate it to 1 - 9.71 x 10^-14 since the exponent of -14 is very small using Taylor Expansion of e^x (e^x =~ 1-x).

$$P(\text{collision}) \approx 9.71 \times 10^{-14}$$

To put that in perspective, the lifetime of the universe is expected to be 1.38x10^10 years. If we generate 1 trillion UUIDs every year, the cumulative probability of the collision over the lifetime of the universe is 0.134%. UUIDs are generated within the scope of a specific system and you can think of each of those systems as their own universe.

Here is a repl to get a better idea of what the collision probability looks like for different bit-lengths. Play around with it and adjust the log-scaling for n values.

https://replit.com/@Wind010/CollisionProbabiilty

So what happens if we upper-case the base64 encoded UUID? We've reduced the character set from a-z, A-Z, 0-9, +, / to 38 characters. The original base64 encoding has a bit length of 6 bits per character. The new effective bit length is:

$$\log_{2}38 \approx 5.247$$

The new bit length is 22 characters * 5.247 =~ 115.434 bits. Using the same formula above, N = 2^115.434 which is the possible unique UUIDs after upper-casing for 1-trillion UUIDs:

$$P(\text{collision}) \approx 1.53 \times 10^{-11}$$

If you were to use the final upper-cased base64 22 character string as a unique identifier, it would probably still be unique without a collision occurring. So if this value was being used in big data for machine learning, it would probably not have any effect. It's still not ideal, but it still unique enough based depending on how many of the UUIDs are generated.

MD5

Message-digest algorithm is a hashing function that produces a 128-bit hash value. Hashing is a one-way function to map data of arbitrary size to another value. MD5 itself used to be used for hashing passwords. Even with a salt (known secret bits: MD5(salt+password) or MD5(password+salt). It is not recommended to use MD5 to store passwords. It is not recommended for any cryptographic operations since collisions can be coerced and was declared cryptographically broken in 2008. SHA1 (160-bits) was broken since 2004. MD5 is still used for a checksum of sorts for files, but can also be exploited. Today, use Argon2, PBKDF2, and Bcrypt which all require a salt.

To validate if the value is indeed an MD5 hash, a dictionary lookup site like https://crackstation.net/ can be used:

Summary

Don't upper/lower case your base64 encoded strings people, since the point is to encode some data for some decoding at a future time after transferring the data in that restricted ASCII character set. Don't use MD5 or SHA1 hashing algorithm for cryptographic purposes since there they are currently insecure (collisions, rainbow-tables, and brute-force with current consumer hardware). I wish Hashnode had inline Mathjax/Latex support.

Uppercase/lowercase Base64 encoded MD5 Strings

Don't.

Decode

UUID

MD5

Summary

References