After finishing this problem, I thought it was a bit too easy and was compelled to redo it manually in python. Might as well write the encoder.
Cause it's fun and the rabbit-holes...
Encode
With Base64 we're converting any non-ASCII compliant characters or binary into a subset of the ASCII
character set.
After refreshing up on the Base64 encoding, the first step was to create the corresponding character table:
BASE64_CHARS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'
OR using string
:
import string
string.ascii_uppercase + string.ascii_lowercase + string.digits + string.digits + "+/"
Take the message and convert it into binary by getting the code/int of an unicode
character with ord
. For example ord('a')
would output 97
which we would convert to binary with format(97, 'b')
= 1100001
:
message_in_binary = "".join(format(ord(x), 'b').zfill(in_byte_size) for x in message)
Could use bin()
, but that would yield 0b
prefix to your string that represents the binary number. For example bin(ord('a'))
would output '0b1100001'
. Notice that even though python 3 strings are unicode
, both the output is 7 bits long without the 0b
prefix. This looks to be ASCII
representation since it is 7-bit long. Hence, the zfill(8)
to get the 8-bit number representing each character in the message.
By doing this, we have increased the overall data. If we have 1 character each 7-bit long that is 1*7=7 bits
and when converted to Base64 would be 2*6=12 bits
since 7 bits wouldn't fit in 6 bits. This is almost twice the size of the 12/7 ~= 1.71428
of the original binary representation. When converted to a Base64 string, it would be 4 characters long since. That is a 3 times increase in characters. The efficiency will depend on the length of the string. The shorter the string, the worse the efficiency to a point. From Wikipedia:
This encoding causes an overhead of 33–37% (33% by the encoding itself; up to 4% more by the inserted line breaks).
(n + 2 - ((n + 2) % 3)) / 3 * 4 bytes
Next, we slice the total binary string representation into 6-bit chunks or bytes. Typically we think of a byte as being composed of 8 bits representing a character and each byte representing a character. This depends on the character encoding.
for i in range(0, len(message_in_binary), out_byte_size):
chunk = message_in_binary[i:i + out_byte_size]
print(chunk)
character | t | e | s | t | 0 |
ascii/unicode - code | 116 | 101 | 115 | 116 | |
8-bit binary code representation | 01110100 | 01100101 | 01110011 | 01110100 | 00000000 |
Notice that the last column has only two zeros which we pad with zeros to fill it to 8-bits. We represent those leftover bits with the padding character =
.
if len(chunk) < out_byte_size:
diff = out_byte_size - len(chunk)
chunk += '0' * diff # or chunk.zfill(out_byte_size)
padding = '=' * (diff // 2) # or int(diff / 2)
For each of the 8-bit bytes we obtain the integer representation of the character using int(binary_number_string, 2)
and then look up the corresponding character in the Base64 table (char_set) by index.
enc_msg += char_set[int(chunk, BaseEncoder.BINARY_BASE)]
After the chunk, we add back the pre-determined pad characters if any were needed. We should have returned dGVzdA==
for the input string of test
.
Decode
To decode, we conduct the operation in reverse. First, remove the padding characters. Why did we need them at all? Seems like they're used to preserve the string/segment if sent in sequence over a network and concatenated. The Base64 string should always be divisible by 4.
message = message.replace('=', '')
Take the characters and look up the integer code for them with the base64 character table.
d_char_to_index = {c: i for i, c in enumerate(char_set)}
Get the binary representation of those base64 character codes and ensure that they are padded 6-bit bytes since format()
will short-change you if it has a leading 0
and we need this to be accurate for chunking out 8-bit bytes:
message_in_binary = ''.join([format(d_char_to_index[c], 'b').zfill(in_byte_size) for c in message])
For each 6-bit byte, we pad again to get an 8-bit byte:
chunk = message_in_binary[i:i + out_byte_size].zfill(out_byte_size)
After handling empty bytes we get the character code integer and call chr()
to get the unicode
character:
if chunk != '0' * out_byte_size:
dec_msg += chr(int(chunk, BaseEncoder.BINARY_BASE))
character | d | G | V | z | d | A | \= | \= |
Base64 Code/Int | 29 | 6 | 21 | 51 | 29 | 0 | ||
6-bit binary code representation | 011101 | 000110 | 010101 | 110011 | 011101 | 000000 | ||
8-bit binary code representation | 01110100 | 01100101 | 01110011 | 01110100 | ||||
ascii/unicode characters | t | e | s | t |
Take the 8-bit binary code representation row with a grain of salt since it's not a 1:2 mapping, just adding it to show approximate overlap.
Full code
Repl
Applications
Mainly used to convert any character set to the base64 character set (plain-text enough) for transmission over networks. Anything that is represented by bits can be encoded in this manner. You could store an image file base encoded in the HTML page.
Base32 encoding has been used in my implementation of Time-base One-time Password (TOTP) in .NET.
Multipurpose Internet Mail Extensions (MIME)
Let me know of other applications.
Security
Base64 encoding is not encryption. Any encoding is not encryption. It only obfuscates the input text. Don't use the encoding to secure sensitive information. I've seen this as part of a CTF challenge.
Some websites could be using Base64 encoded queries as part of their URL. This could lead to any number of types of injections (SQL, COMMAND, XML, JSON) if the actual decoded query string was not sanitized/validated.
Invalid Base64 characters found inside an expected Base64 encoded string could bypass WAF.
It's also a method to deliver blobs/payloads like classes and scripts as in the case of the Log4J vulnerability and (Log4Shell exploit) disclosed earlier this year.
Let me know other related Base64 security issues.
Learnings
An
ASCII
character is 7 bits or a 7-bit byte.I wanted to abstract the chunking logic and found that the
yield
keyword is not expected to work in Python 3.8 since the original implementation in 3.7 was a bug.chunk = lambda l, n: [(yield l[i:i + n]) for i in range(0, len(l), n)]
Abstractions to convert character to binary and binary to character:
char_to_binary = lambda c: [bin(ord(x))[2:].zfill(8) for x in s binary_to_char = lambda b: ''.join([chr(int(x, 2)) for x in b])
For performance write this in a lower-level language like C/C++.
Don't use this code for production, this was for learning. Use the
base64
python library.This rabbit-hole keeps on going...
Edit 01/13/2023:
Looks like I worked on some Base62 encoding/decoding a couple of years ago in Powershell.
References
https://en.wikipedia.org/wiki/Bit_numbering#Bit_significance_and_indexing
https://github.com/python/cpython/blob/main/Modules/binascii.c
https://www.imperva.com/blog/the-catch-22-of-base64-attacker-dilemma-from-a-defender-point-of-view/
https://owasp.org/www-project-top-ten/2017/A3_2017-Sensitive_Data_Exposure
https://blogs.juniper.net/en-us/security/in-the-wild-log4j-attack-payloads