News & Releases

Minimize

Inside Adobe PDF format

   Minimize

Inside Adobe PDF format

Pdf (Portable Document Format) is a file format created by Adobe System in 1993, it is used for document exchanging between platforms. It is independent of software application, hardware and system operation. Each file may contain text, fonts, images and vector graphics which compose the document. Formally a private format, now Pdf was release as an open standard in 2008.  

A PDF file consists primarily of objects, of which there are eight types: 

  1. Boolean values, representing true or false
  2. Numbers
  3. Strings
  4. Names
  5. Arrays, ordered collections of objects
  6. Dictionaries, collections of objects indexed by Names
  7. Streams, usually containing large amounts of data
  8. The Null object

 
Objects may be either direct (embedded in another object) or indirect. Indirect objects are numbered with an object number and a generation number. An index table called the xref table gives the byte offset of each indirect object from the start of the file.  

Text:

Text in PDF is represented by text elements in page content streams. A text element specifies that characters should be drawn at certain positions. The characters are specified using the encoding of a selected font resource.  

Fonts:

A font object in PDF is a description of a digital typeface. It may either describe the characteristics of a typeface, or it may include an embedded font file. The latter case is called an embedded font while the former is called an unembedded font. The font files that may be embedded are based on widely used standard digital font formats: Type 1, TrueType, and OpenType. Additionally PDF supports the Type 3 variant in which the components of the font are described by PDF graphic operators.  

Encodings:

Within text strings, characters are shown using character codes (integers) that map to glyphs in the current font using an encoding. There are a number of built-in encodings, including WinAnsi, MacRoman, and a large number of encodings for East Asian languages.  

Vector graphics:

PDF supports several types of patterns. The simplest is the tiling pattern in which a piece of artwork is specified to be drawn repeatedly. This may be a colored tiling pattern, with the colors specified in the pattern object, or an uncolored tiling pattern, which defers color specification to the time the pattern is drawn.  

Raster images:

Raster images in PDF (called Image XObjects) are represented by dictionaries with an associated stream. The dictionary describes properties of the image, and the stream contains the image data. Less commonly, a raster image may be embedded directly in a page description as an inline image. 
 

Under the hood

The general structure of a PDF file is composed of the following code components: header, body, cross-reference (xref) table, and trailer.  

The header contains just one line that identifies the version of PDF.

Example: %PDF-1.6  

The trailer contains pointers to the xref table and to key objects contained in the trailer dictionary. It ends with %%EOF to identify end of file.  

The xref table contains pointers to all the objects included in the PDF file. It identifies how many objects are in the table, where the object begins (the offset), and its length in bytes.  

The body contains all the object information — fonts, images, words, bookmarks, form fields, and so on.