How Base64 Encoding Works

After finishing this problem, I thought it was a bit too easy and was compelled to redo it manually in python. Might as well write the encoder.

Ryan Reynolds But Why GIF - Ryan Reynolds But Why - Discover ...

Cause it's fun and the rabbit-holes...

Encode

With Base64 we're converting any non-ASCII compliant characters or binary into a subset of the ASCII character set.

After refreshing up on the Base64 encoding, the first step was to create the corresponding character table:

BASE64_CHARS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'

OR using string:

import string
string.ascii_uppercase + string.ascii_lowercase + string.digits + string.digits + "+/"

Take the message and convert it into binary by getting the code/int of an unicode character with ord. For example ord('a') would output 97 which we would convert to binary with format(97, 'b') = 1100001:

message_in_binary = "".join(format(ord(x), 'b').zfill(in_byte_size) for x in message)

Could use bin(), but that would yield 0b prefix to your string that represents the binary number. For example bin(ord('a')) would output '0b1100001' . Notice that even though python 3 strings are unicode, both the output is 7 bits long without the 0b prefix. This looks to be ASCIIrepresentation since it is 7-bit long. Hence, the zfill(8) to get the 8-bit number representing each character in the message.

By doing this, we have increased the overall data. If we have 1 character each 7-bit long that is 1*7=7 bits and when converted to Base64 would be 2*6=12 bits since 7 bits wouldn't fit in 6 bits. This is almost twice the size of the 12/7 ~= 1.71428 of the original binary representation. When converted to a Base64 string, it would be 4 characters long since. That is a 3 times increase in characters. The efficiency will depend on the length of the string. The shorter the string, the worse the efficiency to a point. From Wikipedia:

This encoding causes an overhead of 33–37% (33% by the encoding itself; up to 4% more by the inserted line breaks).

(n + 2 - ((n + 2) % 3)) / 3 * 4 bytes

Next, we slice the total binary string representation into 6-bit chunks or bytes. Typically we think of a byte as being composed of 8 bits representing a character and each byte representing a character. This depends on the character encoding.

for i in range(0, len(message_in_binary), out_byte_size):
    chunk = message_in_binary[i:i + out_byte_size]
    print(chunk)

character	t	e	s	t	0
ascii/unicode - code	116	101	115	116
8-bit binary code representation	01110100	01100101	01110011	01110100	00000000

Notice that the last column has only two zeros which we pad with zeros to fill it to 8-bits. We represent those leftover bits with the padding character =.

if len(chunk) < out_byte_size:
    diff = out_byte_size - len(chunk)
    chunk += '0' * diff  # or chunk.zfill(out_byte_size)
    padding = '=' * (diff // 2) # or int(diff / 2)

For each of the 8-bit bytes we obtain the integer representation of the character using int(binary_number_string, 2) and then look up the corresponding character in the Base64 table (char_set) by index.

enc_msg += char_set[int(chunk, BaseEncoder.BINARY_BASE)]

After the chunk, we add back the pre-determined pad characters if any were needed. We should have returned dGVzdA== for the input string of test.

Decode

To decode, we conduct the operation in reverse. First, remove the padding characters. Why did we need them at all? Seems like they're used to preserve the string/segment if sent in sequence over a network and concatenated. The Base64 string should always be divisible by 4.

message = message.replace('=', '')

Take the characters and look up the integer code for them with the base64 character table.

d_char_to_index = {c: i for i, c in enumerate(char_set)}

Get the binary representation of those base64 character codes and ensure that they are padded 6-bit bytes since format() will short-change you if it has a leading 0 and we need this to be accurate for chunking out 8-bit bytes:

message_in_binary = ''.join([format(d_char_to_index[c], 'b').zfill(in_byte_size) for c in message])

For each 6-bit byte, we pad again to get an 8-bit byte:

chunk = message_in_binary[i:i + out_byte_size].zfill(out_byte_size)

After handling empty bytes we get the character code integer and call chr() to get the unicode character:

if chunk != '0' * out_byte_size:
    dec_msg += chr(int(chunk, BaseEncoder.BINARY_BASE))

character	d	G	V	z	d	A	\=	\=
Base64 Code/Int	29	6	21	51	29	0
6-bit binary code representation	011101	000110	010101	110011	011101	000000
8-bit binary code representation	01110100	01100101	01110011	01110100
ascii/unicode characters	t	e	s	t

Take the 8-bit binary code representation row with a grain of salt since it's not a 1:2 mapping, just adding it to show approximate overlap.

Full code

https://gist.github.com/Wind010/0e476caabddae14c56208fd6b492f8a1

Repl

https://replit.com/@Wind010/92-Weird-Encoding-1

Applications

Mainly used to convert any character set to the base64 character set (plain-text enough) for transmission over networks. Anything that is represented by bits can be encoded in this manner. You could store an image file base encoded in the HTML page.

Base32 encoding has been used in my implementation of Time-base One-time Password (TOTP) in .NET.

Multipurpose Internet Mail Extensions (MIME)

Let me know of other applications.

Security

Base64 encoding is not encryption. Any encoding is not encryption. It only obfuscates the input text. Don't use the encoding to secure sensitive information. I've seen this as part of a CTF challenge.
Some websites could be using Base64 encoded queries as part of their URL. This could lead to any number of types of injections (SQL, COMMAND, XML, JSON) if the actual decoded query string was not sanitized/validated.
Invalid Base64 characters found inside an expected Base64 encoded string could bypass WAF.
Encoding an attack payload could allow for bypass.
It's also a method to deliver blobs/payloads like classes and scripts as in the case of the Log4J vulnerability and (Log4Shell exploit) disclosed earlier this year.

Let me know other related Base64 security issues.

Learnings

An ASCII character is 7 bits or a 7-bit byte.
I wanted to abstract the chunking logic and found that the yield keyword is not expected to work in Python 3.8 since the original implementation in 3.7 was a bug.
```
  chunk = lambda l, n: [(yield l[i:i + n]) for i in range(0, len(l), n)]
```

Abstractions to convert character to binary and binary to character:

  char_to_binary = lambda c: [bin(ord(x))[2:].zfill(8) for x in s
  binary_to_char = lambda b: ''.join([chr(int(x, 2)) for x in b])

For performance write this in a lower-level language like C/C++.
Don't use this code for production, this was for learning. Use the base64 python library.
This rabbit-hole keeps on going...

Edit 01/13/2023:
Looks like I worked on some Base62 encoding/decoding a couple of years ago in Powershell.

Base64 Encoder and Decoder