All About HackingBlackhat Hacking ToolsFree CoursesHacking

PDF file format: Basic structure Guide by Blackhat Pakistan 2023

In Today’s article we will learn about PDF file format: Basic structure.

We all know that there are a number of attacks where an attacker embeds some shell code into a PDF document. This shellcode exploits a type of vulnerability in the way a PDF document is parsed and presented to users to execute malicious code on the target system.

The following figure represents the number of vulnerabilities discovered in the popular Adobe Acrobat Reader DC PDF reader, which was released in 2015 and became the only supported version of Acrobat Reader after Acrobat XI ended support in October 2017. The number of vulnerabilities is growing. years. The most important vulnerabilities are code execution vulnerabilities, which an attacker can use to execute arbitrary code on a target system (if Acrobat Reader has not already been patched).

Figure 1: Adobe Acrobat Reader DC vulnerabilities

This is an important indicator that we should update our PDF Reader regularly, as the number of recently discovered vulnerabilities is quite daunting.

PDF file structure[PDF file format: Basic structure]


Whenever we want to discover new vulnerabilities in software, we should first understand the protocol or file format in which we are trying to discover new vulnerabilities. In our case, we should first understand the PDF format in detail. In this article, we’ll take a look at the PDF file format and its ins and outs.

PDF is a portable document format that can be used to present documents containing text, images, multimedia elements, web page links, and more. It has a wide range of functions. The PDF file format specification is publicly available here and can be used by anyone interested in the PDF file format. There are nearly 800 pages of documentation just for the PDF format, so reading it isn’t something you can do on a whim.More Info Here:Gapz: Advanced VBR Infection 2023 by Blackhat Pakistan

A PDF has more features than just text: it can contain images and other multimedia elements, be password protected, run JavaScript, and so on. The basic structure of a PDF file is shown in the image below:

Figure 2: PDF structure

Every PDF document has the following elements:

Header
This is the first line of the PDF file and specifies the version number of the PDF specification used by the document. If we want to find out, we can use a hex editor or simply use the xxd command as shown below:

[simple]

xxd temp.pdf | head – n 1

0000000: 2550 4446 2d31 2e33 0a25 c4e5 f2e5 eba7 %PDF-1.3.%……
[/simple]

The temp.pdf PDF document uses the PDF 1.3 specification. The ‘%’ character is a comment in a PDF, so the above example actually represents the first and second lines as comments, which applies to all PDF documents. The following bytes are taken from the output below: 2550 4446 2d31 2e33 0a25 c4e5 and correspond to the ASCII text “%PDF-1.3.%”. The following are a few ASCII characters that use non-printing characters (note the “.” dots), which usually tell some software products that the file contains binary data and should not be treated as 7-bit ASCII text. Currently, version numbers are of the form 1.N, where N is in the range 0-7.

Body
In the body of a PDF document are objects that usually contain text streams, images, other multimedia elements, etc. The Body section is used to store all the document data that is displayed to the user.

xref table
This is a cross-reference table that contains references to all objects in the document. The purpose of the cross-reference table is to allow random access to objects in the file so that we don’t have to read the entire PDF document to find a specific object. Each object is represented by one entry in the cross-reference table, which is always 20 bytes long. Let’s see an example:

[simple]
xref
0 1
0000000023 65535 f
3 1
0000025324 00000 n
21 4
0000025518 00002 n
0000025632 00000 n
0000000024 00001 f
0000000000 00001 f
36 1
0000026900 00000 n
[/simple]

We can view the cross-reference table of a PDF document by simply opening the PDF with a text editor and scrolling to the bottom of the document. In the example above, we can see that we have four subsections (notice the four lines that contain only two numbers). The first number in these lines corresponds to the number of the object, while the second line indicates the number of objects in the current subsection. Each object is represented by a single entry that is 20 bytes long (including CRLF).

The first 10 bytes are the offset of the object from the beginning of the PDF document to the beginning of this object. This is followed by a space separator with another number specifying the generation number of the object. This is followed by another space separator followed by an “f” or “n” to indicate whether the object is free or in use.

The first object has ID 0 and always contains one record with generation number 65535, which is at the head of the list of free objects (note the letter “f” means free). The last object in the cross-reference table uses generation number 0.

The second subsection has an object ID of 3 and contains one element, object 3, which starts at an offset of 25324 bytes from the beginning of the document. The third subsection has four objects, the first of which has ID 21 and starts at offset 25518 from the beginning of the file. The other objects have the following numbers 22, 23 and 24.

All objects are marked with an “f” or “n” flag. The “f” flag means that the object may still be present in the file, but it is marked as free, so it should not be used. These objects contain a reference to the next free object and a generation number that will be used if the object becomes valid again. The “n” flag is used to represent valid and used objects that contain the offset from the beginning of the file and the object’s generation number.

Notice that object zero points to the next free object in the table, object 23. Since object 23 is also free, it itself points to the next free object in the table, object 24. But object 24 is the last free object on the file, so it points back to object zero. If we were to represent the above cross-reference table with each object number, it would look like this:

[simple]
xref
0 1
0000000023 65535 f
3 1
0000025324 00000 n
21 1
0000025518 00002 n
22 1
0000025632 00000 n
23 1
0000000024 00001 f
24 1
0000000000 00001 f
36 1
0000026900 00000 n
[/simple]

The object’s generation number is incremented when the object is freed, so if the object becomes valid again (changes the flag from ‘f’ to ‘n’), the generation number is still valid without needing to be incremented. The generation number of object 23 is 1, so if it becomes valid again, the generation number will still be 1, but if it is deleted again, the number of generations will increase to 2.

PDFs that have been updated over time usually have multiple subsections, otherwise only one subsection starting with zero should be present.

Trailer
The PDF trailer specifies how the application loading the PDF document should find the cross-reference table and other special objects. All PDF readers should start reading the PDF from the end of the file. An example trailer is shown below:
trailer
<<
/Size 22
/Root 2 0 R
/Information 1 0 R
>>
startxref
24212
%%EOF
The last line of the PDF contains the end of file string “%%EOF”. Before the end of the file tag is a line with the string startxref, which specifies the offset from the beginning of the file to the cross-reference table. In our case, the cross-reference table starts at offset 24212 bytes. Before that is the trailer string that specifies the start of the Trailer section. The contents of the trailer parts are enclosed in << and >> characters (this is a dictionary that accepts key-value pairs).

  • We can see that the trailer section defines several keys, each one for a specific action. The trailer section can specify the following keys:
  • /Size [integer]: Specifies the number of entries in the cross-reference table (also counts objects in updated sections). The number used should not be an indirect link.
  • /Prev [integer]: Specifies the offset from the beginning of the file to the previous cross-reference section, which is used if there are multiple cross-reference sections. The number should be a cross reference.
  • /Root [dictionary]: Specifies a reference object for a document catalog object, which is a special object that contains various pointers to various kinds of other special objects (more on that later).
  • /Encrypt [dictionary]: Specifies the encryption dictionary of the document.
  • /Info [dictionary]: Specifies a reference object for the document’s information dictionary.
  • /ID [array]: Specifies an array of two-byte unencrypted strings that make up the file identifier.
  • /XrefStm [integer]: Specifies the offset from the beginning of the file to the cross-reference stream in the decoded stream.

This is only present in hybrid reference files, which is indicated if we would also like to open the documents, even if the applications do not support comp.

Figure 3: PDF structure

We can see that the PDF document still contains the original header, body, cross-reference table, and trailer. In addition, additional body parts, cross-references, and trailers are added to the PDF document. Additional cross-reference sections will only contain entries for objects that have been changed, replaced, or deleted. Deleted objects will remain in the file, but will be marked with the “f” flag. Each trailer must be terminated with a “%%EOF” tag and should contain a /Prev entry that points to the previous cross-reference section.

In PDF versions 1.4 and higher, we can override the default version from the PDF header by specifying a version in the document catalog dictionary.

Example
Let’s show a simple PDF example and analyze it. Here we download a sample PDF document and analyze it. After opening this PDF document, it looks like this:

Figure 4: PDF document sample

The cross-reference and trailer sections are presented in the picture below:

Figure 5: Cross-reference and trailer sections

The cross-reference section has been reduced for clarity. The cross-references section contains one subsection, which itself contains 223 objects. The trailer section starts at byte offset 50291, contains 223 objects, where the root element points to object 221 and the info element points to object 222.

In the next section, we will look at the basic data types of the PDF structure.

PDF data types


A PDF document contains the eight basic types of objects described below. These types are: booleans, numbers, strings, names, arrays, dictionaries, streams, and the null object. Objects can be marked so that other objects can reference them. A marked object is also called an indirect object.

Booleans
There are two keywords: true and false that represent boolean values.

Numbers
There are two types of numbers in a PDF document: integer and real. An integer consists of one or more digits that may be preceded by a plus or minus sign. An example of integer objects can be seen below:

123 +123 -123
The actual value can be represented by one or more digits, with an optional sign and a leading, trailing, or embedded decimal point (period). An example of real numbers can be seen below:

123.0 -123.0 +123.0 123, -.123
Names
Titles in PDF documents are represented by a sequence of ASCII characters in the range 0x21 – 0x7E. Exceptions are the characters: %, (, ), <, >, [, ], {, }, / and #, which must be preceded by a slash. An alternative representation of characters is with their hexadecimal equivalent preceded by a “#”. There is a limit to the length of the name element, which can only be 127 bytes long.

When writing a name, a slash must be used to indicate the name; the slash is not part of the name, but is a prefix indicating that what follows is a sequence of characters representing the name. If we want to use spaces or any other special character as part of the name, it must be encoded in two-digit hexadecimal notation.

You can see examples of names in the table below:

Figure 6: PDF names (source)

Strings


Strings in a PDF document are represented as a series of bytes surrounded by parentheses or curly braces, but can be a maximum of 65535 bytes long. Any character can be represented by an ASCII representation and alternatively by octal or hexadecimal representations. The octal representation requires that the character be written in the form ddd, where ddd is an octal number. Hexadecimal representation required the character to be written in the form where dd is a hexadecimal number.

An example representation of a string enclosed in parentheses can be seen below:

  • (mystring)
  • An example representation of a string enclosed in curly braces can be seen below (the hexadecimal representation below is the same as above and reads “mystring”):
  • <6d79737472696e67>


When representing a string, we can also use special well-known characters. They are: n for newline, r for carriage return, t for horizontal tab, b for backspace, f for form shift, ( for left parenthesis, ) for right parenthesis, and for backslash.

Field

Fields in PDF documents are represented as sequences of PDF objects that can be of different types and are enclosed in square brackets. This is why a field in a PDF document can contain any type of object, such as numbers, strings, dictionaries, and even other fields. An array can also have zero elements. The field is shown with square brackets. An example field is shown below:

  • 123 123.0 true (mystring) /myname]

Dictionaries


Dictionaries in a PDF document are represented as a table of key/value pairs. The key must be an object name, while the value can be any object, including another dictionary. The maximum number of passwords in the dictionary is 4096 passwords. The dictionary can be presented with the words enclosed in double curly brackets << and >>. An example dictionary is shown below:
<< /mykey1 123

 /mykey2 0.123
 /mykey3 << /mykey4 true
                     /mykey5 (mystring)
                >>

>

  • Currents


A stream object is represented by a sequence of bytes and can be of unlimited length, so images and other large data blocks are usually represented as streams. A stream object is represented by a dictionary object followed by a stream of keywords followed by a newline and a trailing stream.

An example of a stream object can be seen below:
<<

/Type /Page

 / Length 23 0 R
 /Filter /LZWDecode

>

current

final current
All stream objects must be indirect objects and the stream dictionary must be a direct object. The stream dictionary specifies the exact number of bytes of the stream. The data should be followed by a new line and the keyword endstream.

Common keywords used in all stream dictionaries are as follows (note that Length is required):

  • Length: How many bytes of the PDF file are used for stream data. If the stream contains a Filter entry, the length indicates the number of bytes of encoded data.
  • Type: The type of PDF object that the dictionary describes.
  • Filter: The name of the filter that will be used when processing stream data. Multiple filters can be specified in the order in which they are to be applied.
  • DecodeParms: The dictionary or array of dictionaries used by the filters specified by the filter. This value specifies the parameters to pass to the filters when they are applied. This is not necessary if the filters use default values.
  • F: Specifies the file containing the stream data.
  • FFilter: The name of the filter to use when processing the data found in the external stream file.
  • FDecodeParms: A dictionary or array of dictionaries used by the filters specified by FFilter.
  • DL: Specifies the number of bytes in the decoded stream. This can be used if enough disk space is available to write the stream to the file.
  • N: Number of indirect objects stored in the stream.
  • First: Offset in the decoded stream of the first compressed object.
  • Extends: Specifies a reference to other streams of objects that make up the inheritance tree.


The data stream in the object stream will contain N pairs of integers, where the first integer represents the number of the object and the second integer represents the offset in the decoded stream of that object. Objects in object streams are contiguous and do not need to be stored in ascending order of object number. The First entry in the dictionary identifies the first object in the stream of objects.

We should not store the following information in the object stream:

  • Streaming objects
  • Objects with a generation number that is not equal to zero
  • Document encryption dictionary
  • An indirect object of Length in the object stream dictionary
  • Document catalog, linearization dictionary, page objects


In PDF 1.5, cross-reference information can be stored in a cross-reference stream instead of a cross-reference table. Each cross-reference stream contains information equivalent to a cross-reference table and a trailer.

Object zero


A null object is represented by the “null” keyword.

Indirect objects


First of all, we need to know that any object in a PDF document can be marked as an indirect object. This gives the object a unique object identifier that other objects can use to refer to the indirect object. An indirect object is a numbered object represented by the “obj” and “endobj” keywords. Endobj must be present on its own line, but obj must appear at the end of the object ID line, which is the first line of the indirect object. The object ID string consists of the object number, the generation number and the keyword “obj”. An example of an indirect object is as follows:
2 1 vol

12345

endobj
In the example above, we are creating a new indirect object that contains the object number 12345. When we declare the object as an indirect object, we can use it in the cross-reference table of the PDF document and reuse it on any page, dictionary, and so on in the document. Because each indirect object has its own entry in the cross-reference table, indirect objects can be accessed very quickly.

The object identifier of an indirect object consists of two parts; the first part is the object number of the current indirect object. Indirect objects do not need to be numbered sequentially in the PDF document. The second part is the generation number, which is set to zero for all objects in the newly created file. This number will be incremented later when the objects are updated.

We can refer to indirect objects with an indirect reference that consists of the object number, the generation number and the R keyword. To refer to the above indirect object, we need to write something like below:

  • 2 1 R


If we try to refer to an undefined object, we are actually referring to a null object.

Document structure


A PDF document consists of objects contained in the body part of a PDF file. Most objects in a PDF document are dictionaries. Each page of the document is represented by page objects are connected together and form a page tree, which is declared with an indirect reference in the document catalog.

The whole structure of the PDF document can be represented with the picture below [1]:

Figure 7: Structure of the PDF document (source)

In the image above, we can see that the Document Catalog contains links to the Page Tree, Outline Hierarchy, Article Threads, Named Targets, and an Interactive Form. We won’t go into detail about what each of these sections do, but we’ll just introduce the most important section, the page tree.

Catalog of documents


From the image above, we can see that the document catalog is the root of the objects in the PDF document. We have already said that it is the /Root element in the Trailer PDF section that specifies the document catalog. The document catalog contains references to other objects that define the content of the document. It also contains information that declares how the document will be displayed on the screen. The entries in the document catalog are as follows:

  • /Type: The type of PDF object that the directory describes (in our case it’s Catalog, since it’s a document catalog object).
  • /Version: Version of the PDF specification under which the document was created.
  • /Extensions: Information about developer extensions in this document.
  • /Pages: An indirect reference to the object that is the root of the document’s pages tree.
  • /Dests: indirect reference to the object that is the root of the named destinations object.
  • /Outlines: indirect reference to the outline directory object that is the root of the document’s outline hierarchy.
  • /Threads: an indirect reference to an array of thread dictionaries that represent article threads in a document.
  • /Metadata: an indirect reference to the metadata stream that contains the metadata for the document.


There are many other items that we can see as part of the document catalog, but we will not describe them here. The reader can view the details in our resources. An example of a document catalog is shown below:
1 0 vol

<< /Type /Catalog

/Pages 2 0 R

/PageMode /UseOutlines

/Outlines 3 0 R

>

endobj

Page tree


Document pages are accessed through the page tree, which defines all the pages in a PDF document. The tree contains nodes that represent pages of a PDF document, which can be of two types: intermediate and leaf nodes. Intermediate nodes are also called page tree nodes, while leaf nodes are called page objects.

The simplest page tree structure might consist of a single page tree node that directly references all page objects (so all page objects are leaves).

Each node in the page tree must contain the following items:

  • /Type: The type of PDF object that this object describes (in our case it’s Pages, since we’re talking about page tree nodes).
  • /Parent: Should be present in all nodes of the page tree except the root, where this entry must not be. This item specifies its parent.
  • /Kids: Should be present in all page tree nodes except leaves, and specifies all child elements directly accessible from the current node.
  • /Count: Specifies the number of leaf nodes that are children of this node in the following page tree.


We must remember that the page tree is not related to anything in the PDF document, such as pages or chapters.

A basic example of a page tree is shown below:
2 0 vol

<< /Type /Pages

/Children [ 4 0 R

100 R

24 0 R

]

/Count 3

>

endobj

4 0 vol

<< /Type /Page

>

endobj

10 0 vol

<< /Type /Page

>

endobj

24 0 vol

<< /Type /Page

>

endobj
The page tree above defines a Root object with ID 2, which has three children, objects 4, 10, and 20. We can also see that the leaves of the page tree are dictionaries specifying the attributes of a single document page. There are several attributes that we can use to define them for each page of the document.

We have seen the basic structure of a PDF document and its data types. If we want to start looking for vulnerabilities in PDF readers, we need to change the PDF document so that the PDF reader can’t handle it and crashes. If we manage to get the PDF reader to crash, we’ve usually discovered a security flaw that we can use to run arbitrary code on the target computer.

Example


In this article, we will look at a very simple example of a PDF document. First, we need to create a PDF document in order to analyze it. To create a PDF document, first create a very simple .tex document that contains what you can see in the image below:

Figure 8: Simple document

We can see that the .tex document doesn’t really contain much. We first define the document as an article and then include the content of the article in the start and end documents. We are adding a new section with the title (Introduction) and the static text “Hello World!”.

We can compile a .tex document into a PDF document using the pdflatex command and specifying the name of the .tex file as an argument. The resulting PDF then looks like this, as shown in the image below:

Figure 9: Result

We can see that the PDF document really doesn’t contain much, just the text we’ve actually included, and no images, JavaScript, or other elements.

Example 1


Let’s look at the PDF document structure shown in the output below:
%PDF-1.5

%ÐÔÅØ

3 0 vol <<

/ Length 138

/Filter /FlateDecode

>

current

final current

endobj

10 0 vol <<

/Length1 1526

/Length2 7193

/length3 0

/ Length 8194

/Filter /FlateDecode

>

current

final current

endobj

12 0 vol <<

/Length1 1509

/Length2 9410

/length3 0

/Length 10422

/Filter /FlateDecode

>

current

final current

endobj

15 0 vol <<

/Producer (pdfTeX-1.40.12)

/Creator (TeX)

/CreationDate (D:20121012175007+02’00’)

/ModDate (D:20121012175007+02’00’)

/Trapped /False

/PTEX.Fullbanner (This is pdfTeX, version 3.1415926-2.3-1.40.12 (TeX Live 2011) kpathsea version 6.0.1)

endobj

6 0 vol <<

/Type /ObjStm

/N 10

/First 65

/ Length 761

/Filter /FlateDecode

>

current

final current

endobj

16 0 vol <<

/Type /XRef

/Index [0 17]

/Size 17

/W [1 2 1]

/Root 14 0 R

/Information 15 0 R

/ID [<1DC2E3E09458C9B4BEC8B67F56B57B63> <1DC2E3E09458C9B4BEC8B67F56B57B63>]

/ Length 60

/Filter /FlateDecode

>

current

final current

endobj

startxref

20215

%%EOF
It takes quite a lot of elements to create such a simple PDF document, so we can imagine what a really complicated PDF would look like. We must also remember that all encoded data streams have been removed and replaced with three dots for clarity and brevity.

Let’s introduce the individual parts of the PDF. The header can be seen in the image below:

Figure 10: PDF header

The body can be seen in the picture below:

Figure 11: PDF body

The xref section can be seen in the picture below:

Figure 11: PDF xref

And last, the Trailer section is represented below:

Figure 12: PDF trailer

We have presented all the parts of the PDF document, but we still need to analyze them. The header of a PDF document is standard and we don’t really need to talk about it, and we’ll save the body part for later.

Therefore, we need to look at the xref section first. We see that the offset from the beginning of the file to the xref table is 20215 bytes, which is 0x4ef7 in hex. If we look at the hexadecimal representation of the file that we can get with the xxd tool, we can see what is shown in the image below:

Figure 13: Hexadecimal representation of the file

The highlighted bytes lie exactly at the beginning of offset 20125 bytes from the beginning of the file. The previous 0x0a bytes is a newline and the current 0x31 bytes represent the number 1, which is exactly the beginning of the xref table. This is why the xref table is represented by an indirect object with an ID of 16 and a generation number of 0. (This should be true for all objects because we just created a PDF document and none of the objects have been changed yet. Look at the entire PDF document and we see , that this is clearly true; all objects have a generation number of zero.)

/Type of the indirect object classifies it as an external reference table. The /Index field contains a pair of integers for each subsection in that section. The first integer indicates the first object number in the subsection and the second integer indicates the number of records in the subsection. In our example, the object number is zero and there are 17 items in this subsection. This is also specified by the /Size directive. Note that this number is one greater than the largest number of any object number in the subsection. The /W attribute specifies an array of integers representing the size of the fields in the cross-reference entry, meaning that the fields are one byte, two bytes, and one byte.

Then there is the /Root element, which specifies the catalog directory for the PDF document as object number 14. /Info is the information directory of the PDF document, which is contained in object number 15. The /ID field is required because the Encrypt entry is present and contains two strings , which make up the file identifier. These two strings are used as input to the encryption algorithm.

/Length specifies the length of the encryption key in bits; the value should be a multiple of 8 in the range 40 to 128 (default is 40). In our case, the length of the encryption key is 60 bits. /Filter specifies the name of the security handler for this document; this is also the security handler that was used to encrypt the document. In our case, this is FlateDecode, which encodes data using the zlib/deflate compression method.

We can see that the second part of the external reference table is compressed, so we can’t really read it. Of course, we could use some zlib decompression algorithm on the compressed data, but there is a better option. Why would we write a program for it when a tool already exists? Using pdftk, we can fix the corrupted PDF xref table with the following command:

  • pdftk in.pdf output out.pdf

Then the out.pdf file contains the following xref and trailer sections:

Figure 14: xref and trailer

Obviously the /Root and /Info object numbers and other things changed, but we got the trailer and xref keywords that define the xref table. We can see that there are 14 objects in the xref table.

We could go on and try to decode other sections as well, but that is beyond the scope of this article. Next, we check the document that is not encrypted.

Example 2


Let’s take a look at the sample PDF document available here. Some stream objects are encrypted, but they are not that important now. Since we already know how to handle PDF documents, we won’t waste too many words on simple things.

Let’s open that PDF in a text editor like gvim and look at the trailer section. Now we need to know that all PDF documents should be read from cover to cover. The trailer is shown in the image below:

Figure 15: PDF trailer

Let’s also present the Xref with just a few objects (the rest of them were discarded for clarity):

Figure 16: PDF xref

We can see that the /Root of the PDF document is contained in the object with ID 221 and there is additional information in object 222. Object 221 is the most important object in the entire document, so let’s visualize it:

Figure 17: Object 221

We can see that the object is indeed a Document Catalog. The Page Tree object is 212, the Outlines object is 213, the Names object is 220, and the OpenAction object is 58. We haven’t talked about types other than the Page Tree object, so we’ll continue to talk about the Page Tree only.

The Page Tree object with ID 212 is shown in the figure below:

Figure 18: Page Tree object

Thus, object 212 contains the actual pages of the PDF document. It contains 10 pages, which is exactly right (we can verify this if we open the PDF file in any PDF reader and check the page count).

We know that the Kids attribute specifies all child elements directly accessible from the current node. In our case, there are two direct child nodes with object ID 66 and 135. Object 66 is shown below:

Figure 19: Object 66

Object 66 contains other child elements with ID 57, 69, 75, 97, 108 and 120.

Figure 20: Object 135

Object 135 further defines objects 129, 138, 133, and 158.

If we count all the elements, we see that there are exactly 10 elements, which means 10 pages out of 10 pages. This further means that all objects presented are actually actual pages of the PDF document and do not contain any other child nodes.

All the presented objects are declared similarly, so we will not look at each of the objects in turn. Instead, we look at one object, namely object 57. Object 57 contains is declared as follows:

Figure 21: Object 57

We can see that the object’s type is /Page, which directly implies that this is a leaf node that presents one of the pages of the PDF document. The contents of that PDF page can be found in an object 62:

Figure 22: Object 62

We can see that the actual content of the PDF page is encoded using FlateDecode, which is just a simple zlib encoding algorithm.

Conclusion


We have seen two examples of how PDF documents can be created. With the knowledge we have gained, we can start generating incorrect PDF documents and serving them to different PDF readers. In the event that a particular PDF reader fails to read a particular PDF document, the document contains something that the PDF reader could not handle. This implies the possibility of a vulnerability that should be investigated further.

Finally, if the vulnerability is proven to be present, we can even write a PDF document that contains malicious code that will be executed when the victim opens the PDF document using the vulnerable PDF reader on their target computer. In such cases, the entire computer can be compromised as any malicious code can be executed just by opening the malicious PDF document.

Sources

Vulnerability Statistics, CVE Details

Adobe Support Policies: Supported Product Versions, Adobe

Document management — Portable document format — Part 1: PDF 1.7, Adobe (Archive.org)

References:

[1]: The PDF File Format, accessible on: http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf.

Leave a Reply

Your email address will not be published. Required fields are marked *