This document is also available in these non-normative formats: XML.
This proposal defines a set of XPath 2.0 extension functions to handle ZIP files. It defines one function to read ZIP files structure, functions to read actual entry's content, as well as functions to create brand-new ZIP files or to create ZIP files based on existing template files. It has been designed to be compatible with XQuery 1.0 and XSLT 2.0, as well as any other XPath 2.0 usage.
1 Introduction
1.1 Namespace conventions
1.2 Error management
1.3 What is a ZIP file?
2 Entry extraction
2.1 zip:binary-entry
2.2 zip:html-entry
2.3 zip:text-entry
2.4 zip:xml-entry
3 ZIP File Handling
3.1 zip:entries
3.2 zip:zip-file
3.3 zip:update-entries
4 XML respresentation of ZIP files
4.1 The zip:file Element
4.2 The zip:dir Element
4.3 The zip:entry Element
This specification defines a set of functions to read and write ZIP files structure and actual content. It has been designed as a general ZIP tool set for XPath, while it is expected to be particularly useful with document package formats based on XML and ZIP, as for instance [EPUB], [Open XML], and [OpenDocument].
The module defined by this document defines functions and elements in the namespace
http://expath.org/ns/zip
. In this document, the zip
prefix, when used, is bound to this namespace URI.
Error codes are defined in the namespace http://expath.org/ns/error
. In
this document, the err
prefix, when used, is bound to this namespace
URI.
Error conditions are identified by a code (a QName
.) When such an error
condition is reached in the evaluation of an expression, a dynamic error is thrown,
with the corresponding error code (as if the standard XPath function
error
had been called.) TODO: Codes have not been defined yet.
A ZIP file is a file, identified by a URI, that contains a set of entries, organized as a tree. An entry is either a directory (containing other entries) or a file entry (carrying actual content.) The entries are organized as a tree, where files are leaf nodes, and directories contains other entries. This hierarchy is the structure of the ZIP file. All entries have a unique name among its siblings, and a particular entry can be identified using a path starting at the ZIP file level, down to this entry, passing by each directory in between, separating each name by a slash character '/'.
For instance, the following shows the structure of a ZIP file containing one file
entry with the name README
and one directory with the name
dir
. This directory contains two files, named
content.txt
and content.html
. The path for the later
entry is dir/content.html
:
README dir/ content.txt content.html
Those functions extract a specific entry out of a ZIP file. Because ZIP files do not carry the type of the entries, and because we do not want to get all entries as a plain binary item, there are several functions, one for each type among binary, HTML, text and XML.
Note:
TODO: It has been suggested on the EXPath mailing list that these functions (excluding binary-entry) take an additional $encoding argument
zip:binary-entry
zip:binary-entry
($href asxs:anyURI
, $entry asxs:string
) asxs:base64Binary
Extracts the binary stream from the file positioned at entry
within the
ZIP file identified by $href
and returns it as a Base64 item.
zip:html-entry
zip:html-entry
($href asxs:anyURI
, $entry asxs:string
) asdocument-node()
Extracts the html file positioned at entry
within the ZIP file
identified by $href
, and returns a document node. Because an HTML
document is not necessarily a well-formed XML document, an implementation may use a
specific parser in order to produce an XDM document node, like [TagSoup] or [HTML Tidy]; the details of this process are
implementation-defined.
Functions for getting informaton about ZIP file structure, and also for the creation or modification of ZIP files. These functions use a set of XML elements to define ZIP file structure and content (see 4 XML respresentation of ZIP files).
zip:entries
zip:entries
($href asxs:anyURI
) aselement(zip:file)
Returns a zip:file
element that describes the hierarchical structure of
the ZIP file identified by $href
in terms of directories and ZIP
entries. This is ZIP file metadata only, content must not be returned.
The functions used for ZIP file handling (see 3 ZIP File Handling) all use a top-level XML element, named
zip:file
. This element, along with two further descendant elements,
zip:dir
and zip:entry
, combine to describe the ZIP file of
interest.
In the case of the zip:zip-file
and zip:update-entries
functions, the elements describe both the ZIP file structure and entry content. The
zip:entries
function, however, only describes the ZIP file structure,
not the content; for this reason, certain element attributes for the function aren't
permitted, these are highlighted in the description below.
zip:file
ElementThis is the container element for further elements describing the directory structure and entry contents of a ZIP file.
<zip:file href = uri> zip:dir* zip:entry* </zip:file>
href
is the URI of the ZIP file. The base URI of this is also used
to resolve any relative URIs provided as src
attributes in
descendant elements.
zip:dir
ElementThis element represents a directory within the ZIP file, its position within the
zip:file
element tree corresponds directly with the location of the
directory within the hierarchy of the ZIP file.
<zip:dir name? = string src? = uri> zip:dir* zip:entry* </zip:dir>
name
is the name of the directory within the ZIP file. If
name
is omitted then the directory is named from the basename
given in the src
attribute.
The src
attribute can be used only by the
zip:zip-file
and zip:update-entries
functions. It
gives the URI of a directory which will be copied, with all its contents, into
the corresponding ZIP file directory.
zip:dir
and zip:entry
child elements are used within
a zip:dir
element to define the structure and contents of the
corresponding ZIP directory. These element are used in the case where no
src
attribute of zip:dir
is used.
zip:entry
ElementThis element represents a file within the referenced ZIP file, the position of this element within the element tree corresponds to the location of the entry within the directory hierarchy of the ZIP file.
<zip:entry name? = string src? = uri compressed? = "yes" | "no" method? = "xml" | "html" | "xhtml" | "text" | "base64" | "hex" | qname-but-not-ncname byte-order-mark? = "yes" | "no" cdata-section-elements? = qnames doctype-public? = string doctype-system? = string encoding? = string escape-uri-attributes? = "yes" | "no" indent? = "yes" | "no" normalization-form? = "NFC" | "NFD" | "NFKC" | "NFKD" | "fully-normalized" | "none" | nmtoken omit-xml-declaration? = "yes" | "no" standalone? = "yes" | "no" | "omit" suppress-indentation? = qnames undeclare-prefixes? = "yes" | "no" output-version? = nmtoken> any* </zip:entry>
name
is the name of the entry within the ZIP file. If no
name
is included then the src
attribute must be
used, and the basename of its URI value is then used as the ZIP entry name.
The src
attribute can be used only by the
zip:zip-file
and zip:update-entries
functions. It
gives the URI of a file to copy into the ZIP file entry.
compressed
is used to indicate if the entry is compressed within
the ZIP file (certain entries, for example jpeg files are not normally
compressed). A missing attribute indicates that the entry is compressed.
In the case of the zip:zip-file
and
zip:update-entries
functions: if no src
parameter
is provided, the children of the zip:entry
element are serialized
as the new ZIP file entry, in accordance with options declared by the serialization attributes.
[Definition: All further
attributes for zip:entry
(i.e. all attributes excluding
name
, src
and compressed
), are
serialization attributes used to set the corresponding
serialization parameter defined in [Serialization], as defined for
the XPath 2.1 function fn:serialize()
[F&O 1.1].]
Note:
For themethod
attribute, the EXPath ZIP specification
differs from XPath 2.1 in that it defines additional values,
'base64'
and 'hex'
. These values are used to
indicate that their respective binary encoded schemes are to be decoded and
then saved as a binary file entry.