Last Updated: 2019-06-03 22:05:17 UTC
by Didier Stevens (Version: 1)
We've often shown BASE64 encoded PowerShell scripts in our diary entries. And you might have noticed they contain lots of A characters (uppercase letter a).
Like the PowerShell script in one of our last diary entries. I've highlighted the As for you here:
It's a characteristic of BASE64 encoded PowerShell that helps with its identification.
But why is the prevalence of letter A high?
A PowerShell script passed as a command-line argument (option -EncodedCommand) has to be UNICODE text, encoded in BASE64, per PowerShell's help:
Property Unicode of System.Text.Encoding is little-endian UTF16. ASCII text (e.g. most PowerShell commands) requires only 7 bits to encode, but is encoded with 16 bits (2 bytes) in UTF16. These extra 9 bits are given value 0. Hence you have at least one byte (8 bits) that is composed of only 0 bits: byte 0.
Little-endian means that the least significant byte is stored first. Take letters ISC. In hexadecimal (ASCII), that's 49 53 43. In little-endian UTF16, we take 2 bytes in stead of 1 byte to encode each character, hence it becomes: 49 00 53 00 43 00 (big-endian is 00 49 00 53 00 43).
So, what I've shown here with this example, is that ASCII text encoded in UTF16 contains a lot of bytes with value 0.
In BASE64, a sequence of bytes to be encoded, is split into groups of 6 bits. This means that a byte value of 0 (8 bits 0) will produce 2 times out of 3 a 6-bit group of zeroes.
Let's illustrate this with a FF 00 FF 00 sequence:
111111 110000 000011 111111 000000 001111 111100 000000 111111 110000 000011 111111 000000 001111 111100 000000
The first line shows the bits grouped per 8 (e.g. a byte), and the second line shows the same bits grouped per 6 (e.g. a BASE64 unit). Of the 16 BASE64 units, there are 4 with value 000000 (that's 25%).
With true ASCII characters (most-significant bit is 0), there will be even more 000000 values (e.g. more than 25%).
Each possible BASE64 unit (there are 64 possibilities) is represented by a character: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/.
Unit 000000 is represented by character A, 000001 by character B, ...
Let's put all this together:
- ASCII text encoded as UTF16 contains many 0 values (50%)
- This sequence prepared for BASE64 contains many 000000 units (minimum 25%)
- And represented in BASE64, this sequence contains many A characters (minimum 25%)
- BASE64 encoded, command-line PowerShell scripts contains many A characters (minimum 25%)
In fact, the prevalence of character A in the example above is 41,417%