Have you ever been wondering why SSL certificates have a strange code in their body, that seemingly only consists of letters, numbers, the "+" and the "/" characters? If you've ever checked the actual content of a saved email, the embedded pictures are represented with a similar code in the body.
This code is called base64 and simply put it is meant to convert binary files to text format. But what does that mean, what is a binary file, why does it need to be converted to plain text, and how does it work?
A little background
Computers work with numbers. This is how they are designed, the binary 0 and 1 can represent any number which they can store, calculate with, transmit, etc. But what about letters? In the early 1960s The American Standard Code for Information Interchange (ASCII) decided to map a number to every letter, creating a standard that all computer makers can follow.
They decided to use a whole byte to represent a character. A byte consists of 8 bits. One bit can represent two values: either 0 or 1. Two bytes combined can represent twice as much: 00, 01, 10, 11. In decimal that means 4 different values. Following that logic, 8 bits - a byte - will be able to take 256 values, this way the ASCII table has 256 characters to work with.
In this ASCII table each character - letters, numbers or punctuation marks - are all represented with a number and this is how the computer actually works with them. For instance, the decimal number 65 represents the letter "A" for the computer.
Source: https://theasciicode.com.ar/
Binary vs text files
Having a look on the ASCII table above, we notice that there are a few numbers that represents special characters, like 0, the null character, 10 which is the line feed or 13 that is the carriage return. To the computer these characters - among others - have special meaning, they are only used when running codes, they should never be displayed. In fact, if we try to print them on the monitor, we usually get gibberish and errors back, as the computer think they are code and it tries to execute them which leads to errors.
This is the point where we need to talk about the difference between binary and text files. Binary files can contain all these special characters, they can use all the 256 characters in the ASCII table. Text files only contain printable characters.
Base64 characters
They recognized early that many cases it might be necessary to transmit binary data in simple text format. SSL certificates are perfect examples, they contain a lot of information about the domain, SAN, algorithms used, the public key, serial number, issues, etc, crammed in one X.509 certificate file. However, those files are binary files. Do distribute them easier, we'd need them in a plain text format. This is where Base64 comes into the picture. The Base64 algorithm dissects the original data and encodes them using only printable characters. As the name suggests, 64 characters are allocated for this purpose: the 26 uppercase (A-Z) English letters, 26x lowercase (a-z) letters, the 10x digits (0..9) and the "+" and "/" characters.
They could have gone with any other characters, they chose these two. Also, at the end of a base64 encoded content you might see one or "=" characters, they are for padding only. More on that later.
Base64 mapping
8bit to 6bit
Now let's see how the encoding process work. As discussed earlier, we have the original text that consist of a chain of regular characters that are represented by their 8bit ASCII numbers. We use the string "hello!" as an example.
The letters that you see are stored as the following ASCII numbers in memory.
However, in our Base64 table we have only 64 numbers, that only occupies 6 bits (26 = 64) instead of 8bits (28 = 256)
To convert 8bit characters to 6bits first we need to find the least common multiple to 6 and 8, which is 24.
That simply means the smallest ASCII block we can convert to Base64 will be 24bits long, that translates to 3 ASCII characters. As a result, we'll get a 4 character long Base64 output.
Encode
Our example is just perfect as it contains 6 characters, which is a multiple of 3.
First, he encoder divides the original string into two pieces to create 24 bit large blocks.
The decimal ASCII values of "hel" are 104, 101 and 108. In binary it is 01101000, 01100101, 01101100.
Then the decoder divides this 24 bit block into 4 pieces to get 6 bit chunks: 011010, 000110, 010101, 101100. In decimal, they correspond to number 26, 6, 21 and 44.
The last step is mapping those numbers using the Base64 table where 0 maps to A, 1 to B, etc.
We get the end result of: aGVs
Padding
Our encoder is almost complete, the last thing we need to mention is what happens if the length of the string that we want to encode is not the multiple of 3? That means that the last block won't be 3 characters long. With only two or one characters we cannot run the encoding algorithm as it works with exactly 24bits at a time. This is where padding comes in the picture.
If the last block is one character long, the encoder adds two arbitrary characters to the end of the original string. If the last block is two characters long, only one padding character is needed. I like to use the null char: \x00, but the padding can be anything as it will be discarded at the end of the decoding process anyway). Then our program encodes the block and also adds two "=" marks to the end of the Base64 encoded result, this way the decoder will know how many characters to discard from the end of the decoded string to get the original string back.
As a quick example, see the illustration below. When we encode the string "hello!y", the encoder needs to add two padding characters to support the encoding process. Notice, that "hello!you" and the shorter "hello!y" strings have the same encoded length. This is because the encoder encodes the "hello!yAA" string (containing the padding), then the decoder will know from the two "==" signs do discard the two "AA" padding characters from the end of the string.
Base64 decoding
Decoding is simply the reverse of the encoding process.
The decoder checks the Base64 table to get the number value of the Base64 characters back. "a" becomes 26, "G" is 6 after mapping. Four of these form 24bits, that will be divided into three 8bit pieces. Those are the actual ASCII values of the original characters, after decoding all, if there is padding that is discarded from the end and we're done!
Encoders / Decoders
Python Base64 encoder and decoder - detailed guide
import sys def base64encode(s): i = 0 base64 = ending = '' base64chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/' # Add padding if string is not dividable by 3 pad = len(s) % 3 if pad != 0: while pad < 3: s += "A" ending += '=' pad += 1 # Iterate though the whole input string while i < len(s): b = 0 # Take 3 characters at a time, convert them to 4 base64 chars for j in range(0,3,1): # get ASCII code of the next character in line n = ord(s[i]) i += 1 # Concatenate the three characters together b += n << 8 * (2-j) # Convert the 3 chars to four Base64 chars base64 += base64chars[ (b >> 18) & 63 ] base64 += base64chars[ (b >> 12) & 63 ] base64 += base64chars[ (b >> 6) & 63 ] base64 += base64chars[ b & 63 ] # Add the actual padding to the end base64 += ending # Print the Base64 encoded result print (base64) base64encode(sys.argv[1])
Decoder
import sys def base64decode(s): i = 0 base64 = decoded = '' base64chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/' # Remove padding and save the number to be excluded from decoded if s[-2:] == '==': s = s[0:-2] padd = 2 elif s[-1:] == '=': s = s[0:-1] padd = 1 else: padd = 0 # Take 4 characters at a time while i < len(s): d = 0 for j in range(0,4,1): d += base64chars.index( s[i] ) << (18 - j * 6) i += 1 # Convert the 4 chars back to ASCII decoded += chr( (d >> 16 ) & 255 ) decoded += chr( (d >> 8 ) & 255 ) decoded += chr( d & 255 ) # Remove padding decoded = decoded[0:len( decoded ) - padd] # Print the Base64 encoded result print (decoded) base64decode(sys.argv[1])
PowerShell Base64 encoder and decoder - detailed guide
function Base64Encode($s) { $i = 0 $base64 = $ending = '' $base64chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/' # Add padding if string is not dividable by 3 $pad = $s.length % 3 if ($pad -ne 0) { $s += "A" * (3 - $pad) $ending = "=" * (3 - $pad) } # Iterate though the whole input string while ($i -lt $s.length) { # Take 3 characters at a time, convert them to 4 base64 chars $b = 0 for ($j=0; $j -lt 3; $j++) { # get ASCII code of the next character in line $ascii = [int][char]$s[$i] $i++ # Concatenate the three characters together $b += $ascii -shl 8 * (2-$j) } # Convert the 3 chars to four Base64 chars $base64 += $base64chars[ ($b -shr 18) -band 63 ] $base64 += $base64chars[ ($b -shr 12) -band 63 ] $base64 += $base64chars[ ($b -shr 6) -band 63 ] $base64 += $base64chars[ $b -band 63 ] } # Add the actual padding to the end $base64 += $ending # Print the Base64 encoded result Write-Host $base64 }
Decoder
function Base64Decode($s) { $i = 0 $base64 = $decoded = '' $base64chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/' # Remove padding and save the number to be excluded from decoded if ($s.substring($s.length - 2,2) -like "==") { $s = $s.substring(0, $s.length - 2) $padd = 2 } elseif ($s.substring($s.length - 1,1) -like "=") { $s = $s.substring(0, $s.length - 1) $padd = 1 } # Take 4 characters at a time while ($i -lt $s.length) { $d = 0 for ($j=0; $j -lt 4; $j++) { $d += $base64chars.indexof($s[$i]) -shl (18 - $j * 6) $i++ } # Convert the 4 chars back to ASCII $decoded += [char](($d -shr 16) -band 255) $decoded += [char](($d -shr 8) -band 255) $decoded += [char]($d -band 255) } # Remove padding $decoded = $decoded.substring(0, $decoded.length - $padd) # Print the Base64 encoded result Write-Host $decoded }
Comments