ZIP Module w3c-designation EXPath Candidate Module 12 October 2010 XML Phil Fearon Qutoric Limited Florent Georges H2O Consulting

This proposal defines a set of XPath 2.0 extension functions to handle ZIP files. It defines one function to read ZIP files structure, functions to read actual entry's content, as well as functions to create brand-new ZIP files or to create ZIP files based on existing template files. It has been designed to be compatible with XQuery 1.0 and XSLT 2.0, as well as any other XPath 2.0 usage.

Must be ignored, but is required by the schema...

langusage

This revision is a first draft, it adds proposed function signatures and element descriptors to the 20090526 outline of the spec. The reference to the MarkLogic Zip Module has also been removed.

Introduction

This specification defines a set of functions to read and write ZIP files structure and actual content. It has been designed as a general ZIP tool set for XPath, while it is expected to be particularly useful with document package formats based on XML and ZIP, as for instance , , and .

Namespace conventions

The module defined by this document defines functions and elements in the namespace http://expath.org/ns/zip. In this document, the zip prefix, when used, is bound to this namespace URI.

Error codes are defined in the namespace http://expath.org/ns/error. In this document, the err prefix, when used, is bound to this namespace URI.

Error management

Error conditions are identified by a code (a QName.) When such an error condition is reached in the evaluation of an expression, a dynamic error is thrown, with the corresponding error code (as if the standard XPath function error had been called.) TODO: Codes have not been defined yet.

What is a ZIP file?

A ZIP file is a file, identified by a URI, that contains a set of entries, organized as a tree. An entry is either a directory (containing other entries) or a file entry (carrying actual content.) The entries are organized as a tree, where files are leaf nodes, and directories contains other entries. This hierarchy is the structure of the ZIP file. All entries have a unique name among its siblings, and a particular entry can be identified using a path starting at the ZIP file level, down to this entry, passing by each directory in between, separating each name by a slash character '/'.

For instance, the following shows the structure of a ZIP file containing one file entry with the name README and one directory with the name dir. This directory contains two files, named content.txt and content.html. The path for the later entry is dir/content.html: README dir/ content.txt content.html

Entry extraction

Those functions extract a specific entry out of a ZIP file. Because ZIP files do not carry the type of the entries, and because we do not want to get all entries as a plain binary item, there are several functions, one for each type among binary, HTML, text and XML.

TODO: It has been suggested on the EXPath mailing list that these functions (excluding binary-entry) take an additional $encoding argument

zip:binary-entry zip:binary-entry($href as xs:anyURI, $entry as xs:string) as xs:base64Binary

Extracts the binary stream from the file positioned at entry within the ZIP file identified by $href and returns it as a Base64 item.

zip:html-entry zip:html-entry($href as xs:anyURI, $entry as xs:string) as document-node()

Extracts the html file positioned at entry within the ZIP file identified by $href, and returns a document node. Because an HTML document is not necessarily a well-formed XML document, an implementation may use a specific parser in order to produce an XDM document node, like or ; the details of this process are implementation-defined.

zip:text-entry zip:text-entry($href as xs:anyURI, $entry as xs:string) as xs:string

Extracts the contents of the text file positioned at entry within the ZIP file identified by $href and returns it as a string.

zip:xml-entry zip:xml-entry($href as xs:anyURI, $entry as xs:string) as document-node()

Extracts the content from the XML file positioned at entry within the ZIP file identified by $href and returns it as a document node.

ZIP File Handling

Functions for getting informaton about ZIP file structure, and also for the creation or modification of ZIP files. These functions use a set of XML elements to define ZIP file structure and content (see ).

zip:entries zip:entries($href as xs:anyURI) as element(zip:file)

Returns a zip:file element that describes the hierarchical structure of the ZIP file identified by $href in terms of directories and ZIP entries. This is ZIP file metadata only, content must not be returned.

zip:zip-file zip:zip-file($zip as element(zip:file)) as empty-sequence()

Creates a new ZIP file with the characteristics described by the $zip element passed as the argument.

zip:update-entries zip:update-entries($zip as element(zip:file), $output as xs:anyURI) as empty-sequence()

Modifies a copy of an existing ZIP file with the characteristics described by the elements within the $zip element. The $output argument is the URI where the modified ZIP file is copied to.

XML respresentation of ZIP files

The functions used for ZIP file handling (see ) all use a top-level XML element, named zip:file. This element, along with two further descendant elements, zip:dir and zip:entry, combine to describe the ZIP file of interest.

In the case of the zip:zip-file and zip:update-entries functions, the elements describe both the ZIP file structure and entry content. The zip:entries function, however, only describes the ZIP file structure, not the content; for this reason, certain element attributes for the function aren't permitted, these are highlighted in the description below.

The zip:file Element

This is the container element for further elements describing the directory structure and entry contents of a ZIP file.

<zip:file href = uri> zip:dir* zip:entry* </zip:file>

href is the URI of the ZIP file. The base URI of this is also used to resolve any relative URIs provided as src attributes in descendant elements.

The zip:dir Element

This element represents a directory within the ZIP file, its position within the zip:file element tree corresponds directly with the location of the directory within the hierarchy of the ZIP file.

<zip:dir name? = string src? = uri> zip:dir* zip:entry* </zip:dir>

name is the name of the directory within the ZIP file. If name is omitted then the directory is named from the basename given in the src attribute.

The src attribute can be used only by the zip:zip-file and zip:update-entries functions. It gives the URI of a directory which will be copied, with all its contents, into the corresponding ZIP file directory.

zip:dir and zip:entry child elements are used within a zip:dir element to define the structure and contents of the corresponding ZIP directory. These element are used in the case where no src attribute of zip:dir is used.

The zip:entry Element

This element represents a file within the referenced ZIP file, the position of this element within the element tree corresponds to the location of the entry within the directory hierarchy of the ZIP file.

<zip:entry name? = string src? = uri compressed? = "yes" | "no" method? = "xml" | "html" | "xhtml" | "text" | "base64" | "hex" | qname-but-not-ncname byte-order-mark? = "yes" | "no" cdata-section-elements? = qnames doctype-public? = string doctype-system? = string encoding? = string escape-uri-attributes? = "yes" | "no" indent? = "yes" | "no" normalization-form? = "NFC" | "NFD" | "NFKC" | "NFKD" | "fully-normalized" | "none" | nmtoken omit-xml-declaration? = "yes" | "no" standalone? = "yes" | "no" | "omit" suppress-indentation? = qnames undeclare-prefixes? = "yes" | "no" output-version? = nmtoken> any* </zip:entry>

name is the name of the entry within the ZIP file. If no name is included then the src attribute must be used, and the basename of its URI value is then used as the ZIP entry name.

The src attribute can be used only by the zip:zip-file and zip:update-entries functions. It gives the URI of a file to copy into the ZIP file entry.

compressed is used to indicate if the entry is compressed within the ZIP file (certain entries, for example jpeg files are not normally compressed). A missing attribute indicates that the entry is compressed.

In the case of the zip:zip-file and zip:update-entries functions: if no src parameter is provided, the children of the zip:entry element are serialized as the new ZIP file entry, in accordance with options declared by the serialization attributes.

All further attributes for zip:entry (i.e. all attributes excluding name, src and compressed), are serialization attributes used to set the corresponding serialization parameter defined in , as defined for the XPath 2.1 function fn:serialize().

For the method attribute, the EXPath ZIP specification differs from XPath 2.1 in that it defines additional values, 'base64' and 'hex'. These values are used to indicate that their respective binary encoded schemes are to be decoded and then saved as a binary file entry.

References TagSoup - Just Keep On Truckin'. John Cowan. HTML Tidy Library Project. SourceForge project. EPUB set of specifications. International Digital Publishing Forum. Office Open XML. Microsoft Corporation. OpenDocument format (ODF). The Organization for the Advancement of Structured Information Standards (OASIS). XSLT 2.0 and XQuery 1.0 Serialization. Scott Boag, Michael Kay, Joanne Tong, Norman Walsh, and Henry Zongaro, editors. W3C Recommendation. 23 January 2007. XPath and XQuery Functions and Operators 1.1. Michael Kay, editor. W3C Working Draft. 15 January 2009.