PDF Analysis
Last updated
Last updated
Understanding the internal structure of a PDF file is important for effective analysis. A typical PDF consists of several components, as mentioned below:
Header
: The beginning of a PDF file, containing the version number (e.g., %PDF-1.7
).
Body
: Contains objects such as text, images, and embedded files. Objects are defined by numbers and include dictionaries, streams, and arrays.
Cross-Reference Table (xref)
: Maps object numbers to their byte offset in the file.
Trailer
: Marks the end of the file and contains a reference to the xref
table.
Malicious documents can take many forms, each exploiting different aspects of document processing software. PDF documents are among the most common types used in phishing campaigns. These documents can embed JavaScript, which can be used to exploit vulnerabilities in PDF readers.
While going through the objects, always look for the use of suspicious keywords present in the objects. Keywords are actions and elements that control how a PDF works. PDF files use a variety of keywords to define the properties and behaviors of objects. These keywords specify various document settings, actions, and metadata.
/OpenAction (/AA)
: This specifies an action to be performed when the document is opened. Malicious actors use this to automatically execute malicious scripts without user interaction.
/Launch
: This keyword specifies an action to launch an external application or open a file. This can be used maliciously to execute embedded malware or scripts.
/JavaScript (/JS)
: Specifies a JavaScript action, while /JS
defines the actual script to be executed. Malicious JavaScript can perform a variety of harmful actions, such as downloading malware or stealing information.
/Names
: This includes the names of files that will likely be referred to by the PDF itself. Malicious documents often contain embedded files that are intended to be dropped on the system. The names of these files can be found here. Inspect any entries under /Names
carefully.
/EmbeddedFile
: Used to embed files within the PDF. Malicious PDFs often use this to include executable files or other payloads.
/URI /SubmitForm
: Defines an action to submit form data to a specified URL. This can be used to steal user information or send data to a malicious server.
We'll perform the analysis of a malicious PDF sample that runs Agent Tesla. Agent Tesla is a .NET based Remote Access Trojan (RAT) and data stealer readily available to actors due to leaked builders. The malware is able to log keystrokes, can access the host's clipboard and crawls the disk for credentials or other valuable information. It has the capability to send information back to its C&C via HTTP(S), SMTP, FTP, or towards a Telegram channel.
The switch -e
gives additional information, such as entropy, along with object types and associated object entries.
Reviewing the output from the top, we can observe that the PDF file contains five stream objects, along with an object stream (/ObjStm
). As discussed in previous sections, object streams can encapsulate other objects, making them invisible to standard analysis tools. Therefore, it is essential to manually inspect and decode these streams to reveal any hidden objects and their associated data.
Also, the keyword /OpenAction
is very suspicious. As the name implies, this PDF entry is used to dictate the behavior of the document when the user opens it. Malware often abuses this feature to gain code execution via cmd.exe
or JavaScript
.
We can investigate the contents of keyword /OpenAction
by using --search
or -s
parameter in pdf-parser
, as shown below.
As we can see, the /OpenAction
entry is inside the object 2. However, the contents of /OpenAction
reside in object "4
" because of the "4 0 R
" indirect object. We can examine object 4 by using command -o
.
Interestingly, there is no result for object 4.
The PDFid output that we checked earlier showed /ObjStm
present in the PDF file. So lets search for it using pdf-parser
, as shown below, by providing the -s
or --search
parameter.
As we can see, the object 1 is an object stream /ObjStm
. The /N
entry denotes the number of objects present in the stream; in our case, there are 39 objects present in the stream. The /Filter
entry shows the algorithm used to decode the data, which in our case is FlateDecode
.
Now let's decode object 1
to see the objects present in the stream.
Before we proceed, recall that object streams contain dictionaries. The start and end of a dictionary are identified by the symbols <<
and >>
, respectively. There are 39 dictionaries present in the object stream. Each dictionary represents an object. Each such object has associated entries or PDF keywords.
The challenge here is to identify an object. The object labels can be retrieved by studying the initial numbers mentioned in the decoded object stream:
3 0 4 68 5 93 6 144 7 182 8 205 9 1116 10 1231 11 1248 12 1282 13 1299 14 1376 15 1440 16 1653 17 1725 18 1754 19 1769 20 1833 21 1885 22 2046 23 2060 24 2099 25 2113 26 2274 27 2283 28 2360 29 2414 30 2474 31 2635 32 2796 33 2957 34 3118 36 3279 38 3316 39 3359 40 3396 41 3412 42 3428 44 3453
Here's how it works:
First Number: The first number (/First
) tells you where the first object starts in the stream.
Pairs of Numbers: After that, the numbers come in pairs:
First number in the pair: This is the label (name or ID) of the object.
Second number in the pair: This is the offset (distance) from /First
where the object's data is located.
Order Matters: The position of the label in the sequence matches the order of the object in the stream.
The sequence starts with 3 0 4 68 5 93...
/First
= 3 (the offset of the first object).
Label 0
is at offset 3 + 0 = 3
(first object).
Label 4
is at offset 3 + 68 = 71
(second object).
Label 5
is at offset 3 + 93 = 96
(third object).
...and so on.
To understand the logic behind how it works, let's spin up a Python shell and store this whole stream in a variable called stream
.
This is the logic to parse the stream of /ObjStm
objects. Once all the hidden objects are extracted from the stream object, we can continue our investigation related to the /OpenAction
keyword.
The /OpenAction
in object 2 pointed to an object 4. Now we can see the contents of object 4 here in the above table.
The key /S /Launch
indicates that it's a launch action, which is used to run an external application. The /Win 8 0 R
part references another object (object 8 0) that contains the details of the command to be executed. Let's check object 8.
The /P
key holds a long string of hexadecimal characters, which is a payload. When decoded, this is a JavaScript payload designed to perform some malicious action. The /F
key indicates the file to be executed, which is C:\\Windows\\System32\\mshta
. This is a legitimate Windows executable used to execute HTML Applications (HTA). In this context, it is being used to execute the JavaScript payload contained in the /P
key.
The JavaScript code executes a series of actions designed to run a PowerShell script. It first instantiates an ActiveXObject
via WScript.Shell
to execute a PowerShell command using the Run
method. It also creates a Scripting.FileSystemObject
and configures the system to use the TLS 1.2 security protocol, ensuring compatibility with modern HTTPS endpoints. The PowerShell command includes the -ep Bypass
flag to override policy restrictions and allow unrestricted script execution. It uses Invoke-RestMethod (irm)
to fetch a script from htlfeb24.blogspot.com/.../atom.xml
and immediately executes it via Invoke-Expression (iex)
. A Start-Sleep -Seconds 5
command introduces a 5-second delay, likely to evade certain detection mechanisms. Finally, the script removes itself using Scripting.FileSystemObject
, likely to eliminate forensic evidence.
The above logic was explained in detail so that we can understand how the whole process works. To make this process easier, this can be done automatically using the parameter --objstm
of the PDF-Parser.
Let's also use another tool called PeePDF, which is an interactive tool useful for analyzing PDF documents.
Let's see the details regarding the object related to the /Launch
element.
We can see that it refers to another object, 8 0
. We can open the details of object 8, which reveals the decoded JavaScript that runs the PowerShell command to download a file atom.xml
(most probably a PowerShell script), and execute it.
Let's now check the /OpenAction
element as well, i.e., object 2.
This also leads to the final URL where the malicious PowerShell script is hosted (not available at the time of analysis). The script is downloaded and executed using iex
. Then it goes to sleep and later deletes the script file to hide artifacts.
Let's also see the other additional actions specified in /AA
.
This just pops up an alert window. Let's check the /AA
element 15.
All of these referencing objects refer to the same URI where the script is hosted.
We can dump the image file as a JPEG file using -d
in PDF-Parser.
In the above output from PDF-Parser, we can see that this XObject has the /Subtype Image
. It also has a width and a height. This also has a different /Filter /DCTDecode
, which represents it as a JPEG file.
The screenshot below shows this image is loaded by the PDF viewer and the link it tries to visit that we extracted earlier.
Q1) Locate the sample in the directory "C:\Tools\Maldoc\PDF\Demo\Samples\WikiLoader". Perform analysis of the objects within the sample. What is the value of /URI in object 7? Answer format is a URL.
Answer: https://infplaute.com/international-commercial