Extracting scripts and data from suspect PDF files

Published: 2008-07-15
Last Updated: 2008-07-15 11:53:58 UTC
by Maarten Van Horenbeeck (Version: 1)

Over the last few weeks we’ve received a small number of inquiries on how to assess potentially malicious PDF files. As with any file format, there are two ways to get started: either use a sandbox running a presumed vulnerable version of the file parser (in this case Acrobat Reader), or to have a closer look at the file format.

The former is really the easiest way to go and is probably suitable for most situations. The vast majority of exploit PDFs we have seen execute reliably on an unpatched Acrobat Reader 7, so it’s trivial to get this going. However, in some cases you may want to know about the execution path inside the PDF, and not purely how it affects a random target system, or you may just not have a sandbox environment handy.

The core document describing the PDF format is the PDF Reference 1.7, which can be downloaded from the Adobe PDF developer center. The most interesting information for analysis purposes - an overview of the format - can be found as of page 90.

Broadly put, PDF files consist of a header indicating the version, followed by a body consisting of several objects. At the end of the file is the so-called xref (or cross-reference) table, which points directly to various objects within the file, to allow speedy access. Updates not only consist of changes to the objects, but also to the xref table.

Simple objects can look like:

5 0 obj [statements] endobj
Such objects generally describe aspects of how the PDF file should be presented. Another type of object is the “stream”, which can contain types of data, such as images or scripts, encoded in a number of different ways.

Just last week, we received a copy of a malicious file “basketball roster.pdf”. Flat file scanning using Virustotal showed that detection of this file was lacking:

basketball_roster.pdf MD5 44cf41479559b0dc72a2330a9e8ec6c1 AhnLab-V3 2008.7.11.0 2008.07.10 - AntiVir 7.8.0.64 2008.07.11 HTML/Shellcode.Gen Authentium 5.1.0.4 2008.07.10 - Avast 4.8.1195.0 2008.07.11 - AVG 7.5.0.516 2008.07.11 - BitDefender 7.2 2008.07.11 - CAT-QuickHeal 9.50 2008.07.10 - ClamAV 0.93.1 2008.07.11 - DrWeb 4.44.0.09170 2008.07.11 - eSafe 7.0.17.0 2008.07.10 - eTrust-Vet 31.6.5946 2008.07.11 - Ewido 4.0 2008.07.11 Not-A-Virus.Exploit.Win32.Pidief.ax F-Prot 4.4.4.56 2008.07.10 - F-Secure 7.60.13501.0 2008.07.10 - Fortinet 3.14.0.0 2008.07.11 - GData 2.0.7306.1023 2008.07.11 - Ikarus T3.1.1.26.0 2008.07.11 HTML.Shellcode Kaspersky 7.0.0.125 2008.07.11 - McAfee 5336 2008.07.10 - Microsoft 1.3704 2008.07.11 - NOD32v2 3262 2008.07.11 - Norman 5.80.02 2008.07.10 - Panda 9.0.0.4 2008.07.10 - Prevx1 V2 2008.07.11 - Rising 20.52.41.00 2008.07.11 - Sophos 4.31.0 2008.07.11 - Sunbelt 3.1.1509.1 2008.07.04 - Symantec 10 2008.07.11 - TheHacker 6.2.96.376 2008.07.10 - TrendMicro 8.700.0.1004 2008.07.11 - VBA32 3.12.6.9 2008.07.11 - VirusBuster 4.5.11.0 2008.07.10 - Webwasher-Gateway 6.6.2 2008.07.11 Script.Shellcode.Gen
The first thing I generally do with this type of file is to look for any embedded Javascript. Most bugs affecting Acrobat Reader have involved the Javascript method handling engine, so this is a likely first jump. A quick search for interesting objects with a hex editor revealed two interesting ones: one Javascript, the other containing a binary:

The stream description indicates that a filter FlateDecode has been applied to the bitstream. The PDF standard supports 10 different binary filters, of which FlateDecode is the most common. The reader applications use zlib’s deflate to unpack compressed data, which both allows a wider set of characters to be used, as well as makes the overall file smaller than the sum of its uncompressed objects.

In this case, as both suspicious objects are have been rendered unreadable through compression, we want to uncompress them for further review. The easiest command-line way to inflate deflated PDF content is by using the pdfinflt.ps script included with Ghostscript:

[maarten@mojave ghostscript-8.54]$ gs -- toolbin/pdfinflt.ps /tmp/roster.pdf /tmp/roster.out

ESP Ghostscript 815.02 (2006-04-19) Copyright (C) 2004 artofcode LLC, Benicia, CA. All rights reserved. This software comes with NO WARRANTY: see the file PUBLIC for details. **** Warning: File has a corrupted %%EOF marker, or garbage after %%EOF. **** Warning: An error occurred while reading an XREF table. **** The file has been damaged. This may have been caused **** by a problem while converting or transfering the file. **** Ghostscript will attempt to recover the data. **** Warning: There are objects with matching object and generation **** numbers. The accuracy of the resulting image is unknown. ERROR: /undefined in /BXlevel Operand stack: --nostringval-- 51 0 2 --dict:6/6(ro)(G)-- obj Execution stack: %interp_exit .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push 1 3 %oparray_pop 1 3 %oparray_pop 1 3 %oparray_pop 1 3 %oparray_pop .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push --nostringval-- %loop_continue --nostringval-- Dictionary stack: --dict:1087/1686(ro)(G)-- --dict:0/20(G)-- --dict:143/200(L)-- --dict:241/347(ro)(G)-- --dict:18/24(L)-- Current allocation mode is local Current file position is 4774 ESP Ghostscript 815.02: Unrecoverable error, exit code 1

[maarten@mojave ghostscript-8.54]$
Alas, in many cases, PDF exploits are not created using the most standards-compliant generators, and in the case where they exploit a parser issue, well, it makes sense that they don’t parse cleanly. Inflating all objects in the file using a stock tool seems to be a no-go.

Luckily, there’s a great version of the zlib libraries for Perl, and it’s trivial to write an inflater script:

use Compress::Zlib ;
$processor = inflateInit(); binmode STDIN; binmode STDOUT;

while (read(STDIN, $flatfish, 8192)) { $blowfish = $processor->inflate($flatfish) ; print $blowfish }
die "Parsing error or end of stream\n"

The only thing remaining now would be to copy-paste the stream content from the file into a new binary file, and feed it into the script. However, things get a little bit more complicated. While the deflated content is zlib, PDF uses a slightly different zlib header structure than what the libraries expect.

When opening the PDF in a hex editor, the stream actually starts after the 0D 0A marker following the “stream” string. The next two bytes, 48 89, are in fact the PDF header. In order to make the stream compatible with zlib, change these into a header acceptable to zlib, such as 78 9C. Next, run this file through the Perl script again, with much better results:

[maarten@mojave ~]$ perl inflate.pl < /tmp/deflated.txt function re(count,what) { var v = ""; while (--count >= 0) v += what; return v; } function start() { sc = unescape("%uc933%ub966%u018c%u1beb%u565e%ufe8b%u66ac%u612d%u6600%ue...

} if (app.viewerVersion >= 6.0) { this.collabStore = Collab.collectEmailInfo({subj: "",msg: plin}); } }
From there, you can apply regular Javascript deobfuscation techniques, as discussed in previous diary entries, to investigate the actual scripting employed. In this specific case, the script creates a crafted set of data which exploits a known vulnerability in the Collab.collectEmailInfo function of Acrobat Reader’s Javascript engine (CVE-2007-5659).

-- Maarten

Keywords: PDF malware zlib

0 comment(s)

Internet Storm Center

Extracting scripts and data from suspect PDF files

Comments