Last Updated: 2021-01-18 06:48:17 UTC
by Didier Stevens (Version: 1)
A reader pointed us to a malicious Word document.
There aren't any long strings in this file (the longest is 33 characters). So there isn't a payload here that we can extract directly, like we did in diary entry "Maldoc Strings Analysis".
Let's check if there are URLs in this file, by grepping for http:
Let's take a look at the longest strings (-n 20: strings at least 20 characters long):
If you are a bit familiar with the internals of Word documents, you might recognize this as the name of XML files found inside OOXML files (.docx, .docm, .xlsx, ...).
Let's try oledump.py:
This means that there are no OLE files inside this OOXML file, hence no VBA macros.
It looks like this OOXML file only contains XML files (extensions .xml and .rels). Let's verify by getting statistics of the content of the contained files, by using option -e:
Here is a close look on the statisctics:
All contained files starts with <?xm and have only printable ASCII characters (except one file with 90 bytes >= 127).
So we have no binary files in here, just text files. One possible scenario, is that this .docx file contains a reference (URL) to a malicious payload.
Next step, is to extract all files and search for URLs in them. Now, in Office OOXML files, you will find a lot of legitimate URLs. To get an idea of what type of URLs we have in this document, we use my re-search.py tool to extract URLs, and display a unique list of hostnames found in these URLS, like this:
The following hostnames are legitimate, and found in Office OOXML files:
But the IP address is not. So let's extract the full URLs now, and grep for 104:
I downloaded this document. Let's start again with strings:
4555 characters long: this might be a payload. Let's take a look:
This looks like a lot of hexadecimal data. That's interesting. And notice the 3 curly braces at the end. Hexadecimal data and curly braces: this might be a malicious RTF document. Let's check with the file command (I use my tool file-magic.py on Windows):
This is indeed an RTF file. RTF files can not contain VBA code. If they are malicious, they often contain an exploit, stored as (obfuscated) hexadecimal characters inside the RTF file. Hence the strings command will not be of much use.
I recently updated my tool rtfdump.py to make analysis of embedded objects (like malicious payloads) easier. We use the new option -O to get an overview of all objects found inside this RTF file:
There's one object with name equation... . It's very likely that this is an exploit for the equation editor, and that we have to extract and analyze shellcode.
Let's extract this payload and write it to a file:
Let's see if there are some intesting strings:
The equation editor that is targeted here, only exists as a 32-bit executable. Hence the shellcode must also be 32-bit, and we can use the shellcode emulator scdbg to help us.
We use option -f findsc to let scdbg search for entrypoints, option -r to produce a report, and -f shellcode to pass the shellcode file for analysis:
The shellcode emulator found 4 entry points (numbered 0 to 3). I select entry point 0. This results in the emulation of shellcode, that calls the Win32 API (GetProcAddress, ...). This is clearly 32-bit Windows shellcode. And it decodes itself into memory. We can use option -d to dump the decoded shellcode:
This creates a file: shellcode.unpack. Let's use strings again on this file:
This looks more promising. What are the longest strings:
And finally, we have our URL.