W3C

Archive Module

EXPath Candidate Module 30 September 2013

This version:
http://expath.org/spec/archive/20130930
Latest version:
http://expath.org/spec/archive
Previous version:
http://expath.org/spec/zip/20101012
Editor:
John Lumley, Saxonica Ltd <john@saxonica.com>
Contributors:
Christian Grün, BaseX GmbH <christian.gruen@gmail.com>
Matthias Brantner, 28msec GmbH <matthias.brantner@28msec.com>
Florent Georges, H2O Consulting

This document is also available in these non-normative formats: XML.


Abstract

This proposal provides an API for XPath 2.0 and XPath 3.0 to handle archive data (i.e. collected and possibly compressed sets of files and directories). It defines extension functions to process data from and to such archives files, including creation, determining and setting properties, listing and extracting contents and adding and updating entries. It has been designed to be compatible with XQuery 1.0 and XSLT 2.0, as well as any other XPath 2.0 usage. Some additional features for use in XPath 3.0 are also defined.

Table of Contents

1 Status of this document
2 Introduction
    2.1 Namespace conventions
    2.2 Error management
    2.3 Archive representation
    2.4 Archive types
3 Use cases
    3.1 Creating a simple EPUB document
    3.2 Examining a JAR file
    3.3 Extracting a ZIP archive to a file system
4 Describing archives and entries
    4.1 Archive properties and options
    4.2 Entry descriptions
    4.3 Using map types to describe entries and options
5 Loading and saving archives
6 Information about an archive and its contents
    6.1 arch:options
    6.2 arch:entries
7 Extracting entries from an archive
    7.1 arch:extract-binary
    7.2 arch:extract-text
8 Updating entries in an archive
    8.1 arch:delete
    8.2 arch:update
9 Creating an archive
    9.1 arch:create
10 Functions using XPath3.0 map() type
    10.1 Archive property maps
    10.2 Entry property maps
    10.3 archM:options
    10.4 archM:entries
    10.5 archM:entry-names
    10.6 archM:extract
    10.7 archM:extract-binary
    10.8 archM:extract-text
    10.9 archM:create
    10.10 archM:update
    10.11 archM:delete

Appendices

A References
B Summary of error conditions


1 Status of this document

This document is in an interim draft stage. Comments are welcomed at public-expath@w3.org mailing list (archive).

2 Introduction

2.1 Namespace conventions

The module defined by this document defines several functions, all contained in the namespace http://expath.org/ns/archive. In this document, the arch prefix, when used, is bound to this namespace URI.

Alternative versions of these functions using the proposed XPath3.0 map() type (see 4.3 Using map types to describe entries and options) are defined in the namespace http://expath.org/ns/archive-map. In this document, the archM prefix, when used, is bound to this namespace URI.

Error codes are defined in the same namespace (http://expath.org/ns/archive), and in this document are displayed with the same prefix, arch.

Note:

This follows the suggestion (in late August 2013) for a coherent naming standard in EXPath modules.

Binary file I/O, to read and write complete archives to files, uses facilities defined in [EXPath File], which defines functions in the namespace http://expath.org/ns/file. In this document, the file prefix, when used, is bound to this namespace URI.

Manipulation of binary data itself can employ functions from [EXPath Binary], which defines functions in the namespace http://expath.org/ns/binary. In this document, the bin prefix, when used, is bound to this namespace URI.

2.2 Error management

Error conditions are identified by a code (a QName.) When such an error condition is reached in the evaluation of an expression, a dynamic error is thrown, with the corresponding error code (as if the standard XPath function error() had been called.)

2.3 Archive representation

Archives in this module are represented principally as items of type xs:base64Binary, i.e. in their basic binary (byte sequence) forms.

Archives are treated as being arranged structurally as a description of overall options of the archive and a sequence of named entries. Each entry has:

  • A name, which is treated as a sequence of Unicode characters. In many cases the solidus character (/) is used to imply the entries being logically arranged in positions within a directory tree, but this is not mandatory.

  • A set of properties, denoting at least the uncompressed size of the entry, archive internal properties for the entry, such as the compression method used on the stored data and other indications such as the date of last modification.

  • Data, treated as (possibly null) binary data.

It is most common that archives are considered to be arranged logically as directories, using the entry names to denote paths and file names (e.g. tests/qt3/archive/main.xml) In such circumstances, archives may contain entries to represent the directories themselves (e.g. tests/qt3/archive/) presumably with no data. [This could be used such that full extraction of an archive to a file system generates empty output directories for example.] This specification makes no distinction between these two cases - if an archive has an empty 'directory' entry it will be treated similarly to any other 'file' entry. Semantic intrepretation of entry names as files in directory trees is an application issue.

Note:

Behaviour when entries with duplicate names are detected in an archive is implementation dependent. Nevertheless, if an error is not thrown, only one entry should be returned when reading. Implementations must not write duplicate entries in result archives.

2.4 Archive types

The module is designed to be able to support a number of different types of archive, providing a coherent access mechanism.

The following archive types are required to be supported:

  • [ZIP]: (which also covers derivative archive formats, such as JAR or OpenDocument.)

  • [GZIP] : A compressed archive of a sequence of files

    Note:

    Within GZIP names of entries (original file names) are optional, on a per-file basis, so special measures may need to be taken to handle 'unnamed' sections.

Specific issues arise from i) archives used in streaming situations, where the internal manifests of the archives cannot be completed until all data is written, ii) archives where the order of entries is important, such as JAR, where the mainfest entries need to be first.

Note:

Currently there are no proposals within this module to cover encrypted archives.

3 Use cases

Development of this specification was driven by requirements which some XML developers regularly encounter in examining or generating data which is presented in archival forms. Some typical use cases include:

3.1 Creating a simple EPUB document

An [EPUB] document is a collection of content sections, written in XHTML, with a metadata descriptor (usually the content.opf file) and a navigation description (usually the toc.ncx file), all collected together and potentially compressed in a ZIP format. A simple example of creating such a document in XQuery is:

arch:create(
    (
      { "name" : "mimetype", "compression" : "store" },
      "META-INF/container.xml",
      "OEBPS/content.opf",
      "OEBPS/Text/title.xhtml",
      "OEBPS/Text/chap01.xhtml",
      "OEBPS/toc.ncx"
    ),
    (
      content:mimetype(),
      content:metainf(),
      content:oebps-content(),
      content:title(),
      content:chapter(),
      content:toc()
    )
  )

The user-supplied XQuery function content:mimetype() returns the appropriate mimetype description for the EPUB document as a string ("application/epub+zip"). Each of the other content:*() functions generates a serialized form of the appropriate XML structure, e.g.:

declare function content:title() as xs:string
{
  fn:serialize(
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
      <title>Title Page</title>
    </head>
    
    <body>
      <div>
        <h2 id="heading_id_2">Sample Book</h2>
    
        <h2 id="heading_id_3">A Sample .epub Book</h2>
    
        <h3 id="heading_id_4">Title Page</h3>
      </div>
    </body>
    </html>
  )
};

Using a map struture to define an entry enables properties such as compression to be altered on on entry-by-entry basis. For and EPUB document the mimetype entry must be uncompressed (so effectively it can be read by simple string searching), but other entries may be compressed.

3.2 Examining a JAR file

JAR files contain class code and definitions for Java classes, in entries whose names are path/classname.class. Local classes (classes defined within a class) have separate code entries with a classname outerclass$innerclass. To find all the main package-qualified classes the following XPath should suffice:

for $e in arch:entries(file:read-binary("lib/saxon9-sql.jar"))[ends-with(.,'.class') and not(contains(.,'$'))] 
  return replace(replace($e,'\.class$',''),'/','.')
=> 
   "net.sf.saxon.option.sql.SQLClose", 
   "net.sf.saxon.option.sql.SQLColumn", 
   "net.sf.saxon.option.sql.SQLConnect",
   ....,
   "net.sf.saxon.option.sql.SQLUpdate" 

3.3 Extracting a ZIP archive to a file system

Assuming the ZIP file in question has (empty) entries denoting any directories required, the following XSLT will unzip an archive to the current directory, using the file writing functions of [EXPath File]:

<xsl:variable name="arch" select="file:read-binary($uri)"/>
<xsl:variable name="entries" select="arch:entries($arch)"/>
<xsl:variable name="dirs" select="$entries[ends-with(.,'/')]"/>
<xsl:variable name="required.dirs"
            select="for $r in distinct-values(($entries except $dirs) return
            replace($r,'/[^/]+$','/'))[ends-with(.,'/')]"/>
<xsl:sequence select="for $d in distinct-values(($required.dirs,$dirs))
         return file:create-dir(replace($d,'/$',''))"/>
<xsl:sequence select="for $f in ($entries except $dirs) 
        return file:write-binary($f,arch:extract-binary($arch,$f))"/>

(file:create-dir() creates necessary intermediate directories, so $dirs does not need to be in a sorted order. If the ZIP archive does not have entries for all directories, further intermediate code is required to identify those missing.)

4 Describing archives and entries

The properties of overall archives and individual entries at the XDM level are described by small structured elements, with optional information attached. In this proposal this information is attached as attributes.

Note:

Parallels with XPath 3.0 serialization parameters, which are now sets of (element) nodes, become awkward. In arch:entry we would need to add an element arch:name to hold the name of an entry, rather than rely on the string value. The major point in favour of using elements rather than attributes would be where we need to read or set complex structured parameters, such as character maps. This needs discussion.

4.1 Archive properties and options

Archive options and properties are described as a structured element (element(arch:options)) with the following attributes:

  • format: the type of the archive, e.g. "zip". This is mandatory.

  • algorithm: the default compression used in the archive, e.g. "deflate".

Other attributes may be dependent upon the type of the archive and the implementation.

4.2 Entry descriptions

Entries within the archive can be accessed by name (xs:string) or a structured element (element(arch:entry)). In the latter case the entry name is the string value of the element.

When describing an existing entry in an archive, element(arch:entry) may be returned with the following optional attributes:

  • size: the original file size of the entry.

  • compressed-size: the compressed file size of the entry, i.e. the number of bytes it occupies in the archive.

  • last-modified: the date of last modification of this entry, in xs:dateTime notation.

  • compression-level: an indicator of the level of (lossless?) compression.

When used to create or update an entry in an archive, element(arch:entry) may also have the following optional attributes:

  • last-modified: the date of last modification to be written on this entry, in xs:dateTime notation.

  • compression-level: the level of (lossless?) compression to be used in writing the entry into the archive.

  • encoding: the encoding to be used for converting textual items to a byte sequence, prior to possible compression and writing to the archive.

(In writing actions, unknown attributes are ignored.)

4.3 Using map types to describe entries and options

Proposals in XPath 3.0 have been made for a type map(xs:untypedAtomic,item()*), which could be exploited beneficially for manipulating archives, using the entry name as the key and the (xs:base64Binary) value of the entry as the corresponding value in the map. These maps could be used both as output (in arch:entries() and arch:extract-[text|binary]()) or for input (in arch:update() and arch:create()). Equally well such maps can be used, reading keys only in arch:delete().

An attractive alternative would be for each entry itself to be a map($property as xs:string, $value as item()*), with suitable keys, e.g. content -> xs:base64Binary. Thus the entry 'set' can be a map map(xs;string, map(xs:string,item()*)).

Note:

Functions using such maps for arguments and results could either have separate names (e.g. arch:entries-as-map()) or be defined in a separate namespace (archM:entries()) - in this current draft the second is used. Details are discussed in 10 Functions using XPath3.0 map() type

Support for similar approaches using other map representations, such as [JSONiq] objects may be implementation dependent.

5 Loading and saving archives

This module defines no specific functions for reading and writing archives from files, as distinct from their binary data. The EXPath File Module [EXPath File] provides two suitable functions to do this:

Note:

There may be some desire for some convenience functions arch:write($file as xs:string,....) as empty-sequence() which does creation and file writing as one action.

6 Information about an archive and its contents

6.1 arch:options

Summary

Returns a description of the type and properties of a given archive.

Signature

arch:options($archive as xs:base64Binary) as element(arch:options)*

Rules

The description is returned as an element <arch:options> with an unordered sequence of child elements describing the details. The following are currently supported:

  • arch:format: format of this archive
  • arch:algorithm: the compression algorithm that was used.

If the archive format supports a compression algorithm varying on a per-entry basis, and more than one algorithm has been used in the archive, mixed is returned for arch:algorithm.

Error Conditions

[arch:read-error] is raised if there is an unspecified problem in reading the archive.

Examples

Finding the properties of the archive stored in a file located at $uri:

arch:options(file:read-binary($uri))
=> <arch:options>
     <arch:format>ZIP</arch:format>
     <arch:algorithm>deflate</arch:algorithm>
   </arch:options>

6.2 arch:entries

Summary

Returns the set of entry descriptors for all the entries found within the archive.

Signature

arch:entries($archive as xs:base64Binary) as element(arch:entry)*

Rules

Each descriptor is an element <arch:entry> whose text value is the path of the file within the archive. For more details of this structure see 4.2 Entry descriptions.

The entries are returned in the order in which they encountered serially within the archive.

Error Conditions

[arch:read-error] is raised if there is an unspecified problem in reading the archive.

Notes

There may be a case for providing a sorted version, probably using some form of collation.

Examples

Finding the entries of the archive stored in a file located at $uri:

arch:entries(file:read-binary($uri))
=> <arch:entry size="2194" compressed-size="652" last-modified="2013-07-18T11:22:12">build.xml</arch:entry>
   <arch:entry size="84983" compressed-size="84872" last-modified="2009-03-23T11:15:06">lumley.jpg</arch:entry>
   <arch:entry size="10058" compressed-size="1381" last-modified="2013-08-06T13:14:08">tests/qt3/binary/binary.xml</arch:entry>
     

Counting the number of apparent XML files in the previous example:

count(arch:entries(file:read-binary($uri))[ends-with(.,'.xml')])
=> 2
     

7 Extracting entries from an archive

The module does not attempt to discern the 'type' of an entry (such as 'text', 'XML', 'raw-binary'), leaving that to the programmer. Two forms of reading result are supported: raw binary (xs:base64Binary) and decoded text (xs:string).

7.1 arch:extract-binary

Summary

Returns the sequence of requested entries from the archive as binary data.

Signature

arch:extract-binary($archive as xs:base64Binary,
$entries as xs:string*) as xs:base64Binary*

Rules

Returns as binary data each entry in the archive $in that corresponds to the entry name input, in sequence.

The entries must be returned in the order corresponding to that of the entries requested in $entries, not in the order in which they may exist in the archive.

Multiple requests for the same entry will be honoured, with copies of the entry appearing in corresponding multiple locations in the output sequence .

Error Conditions

[arch:unknown-entry] is raised if an entry requested does not exist in this archive.

[arch:read-error] is raised if there was an unspecified problem in reading the archive.

Notes

There have been suggestions for a signature arch:extract-binary($archive as xs:base64Binary) returning all the entries. In the absence of maps in the return type, this does not make sense, since the entries are totally unlabelled, and to get anything meaningful, a parallel call on arch:entries() would be required.

Examples

Returning the binary data for an entry in the archive stored in a file located at $uri:

arch:extract-binary(file:read-binary($uri),'build.xml')
=> stuff
     

7.2 arch:extract-text

Summary

Returns the sequence of requested entries from the archive as strings. If $encoding is specified the strings are decoded appropriately, otherwise UTF-8 encoding is assumed.

Signatures

arch:extract-text($archive as xs:base64Binary,
$entries as xs:string*) as xs:string*
arch:extract-text($archive as xs:base64Binary,
$entries as xs:string*,
$encoding as xs:string) as xs:string*

Rules

Returns as a string each entry in the archive $in that corresponds to the entry name input, in sequence.

If $encoding is specified the strings are decoded appropriately, otherwise UTF-8 encoding is assumed.

The entries must be returned in the order corresponding to that of the entries requested in $entries, not in the order in which they may exist in the archive.

Multiple requests for the same entry will be honoured, with copies of the entry appearing in corresponding multiple locations in the output sequence .

Error Conditions

[arch:unknown-entry] is raised if an entry requested does not exist in this archive.

[arch:unknown-encoding] is raised if the encoding requested is unknown or unsupported.

[arch:decoding-error] is raised if there was an error in decoding the entry.

[arch:read-error] is raised if there was an unspecified problem in reading the archive.

Notes

This function should be equivalent to the use of arch:extract-binary() and the function bin:decode-string() from [EXPath Binary]:

arch:extract-binary($in,$entries) ! bin:decode-string(.,$encoding) [XPath 3.0]
for $b in arch:extract-binary($in,$entries) return bin:decode-string($b,$encoding)
            [XPath 2.0]

Further conversion into XML can be achieved using the XPath3.0 function fn:parse-XML() on each of the returned strings.

There have been suggestions for a signature arch:extract-text($archive as xs:base64Binary) returning all the entries. In the absence of maps in the return type, this does not make sense, since the entries are totally unlabelled, and to get anything meaningful, a parallel call on arch:entries() would be required.

Examples

Returning the text data for an entry in the archive stored in a file located at $uri:

arch:extract-text(file:read-binary($uri),'build.xml','UTF-8')
=> stuff
     

8 Updating entries in an archive

There are two atomic actions available to change entries within an archive: complete deletion of an entry, or complete updating (overwriting) of that entry - the latter adds new entries when the given name does not already exist in the archive

8.1 arch:delete

Summary

Returns an archive with the given entries deleted.

Signature

arch:delete($archive as xs:base64Binary,
$entries as xs:string*) as xs:base64Binary

Rules

Returns an archive of the same format as $in with all the entries named in $entries deleted.

The relative order of the remaining entries within the archive is preserved.

The uncompressed content, size and last-modified date of the remaining entries shall be the same as those for those entries before deletion. Compressed sizes may alter.

Duplicate entries in $entries are ignored.

If $entries is the empty sequence, the original archive shall be returned.

Error Conditions

[arch:unknown-entry] is raised if an entry requested for deletion does not exist in this archive.

[arch:read-error] is raised if there was an unspecified problem in reading the archive.

Notes

Whilst the uncompressed entries remaining after deletion should of course be the same size and content as those before deletion, depending upon the (lossless) compression algorithm used, the compressed sizes and content might not be. In the absence of a special check, in these circumstances $in may not be identical to arch:delete($in,()). This needs discussion.

Examples

Deleting the entries of the archive stored in a file located at $uri:

arch:entries(arch:delete(file:read-binary($uri),'lumley.jpg'))
=> <arch:entry size="2194" compressed-size="652" last-modified="2013-07-18T11:22:12">build.xml</arch:entry>
   <arch:entry size="10058" compressed-size="1381" last-modified="2013-08-06T13:14:08">tests/qt3/binary/binary.xml</arch:entry>
     

8.2 arch:update

Summary

Returns an archive with each of the given entries in $entries updated to the corresponding values in the sequence $new. If an entry is not found, a new entry is added to the end of the archive.

Signatures

arch:update($archive as xs:base64Binary,
$entries as xs:string*,
$new as xs:base64Binary*) as xs:base64Binary
arch:update($archive as xs:base64Binary,
$entries as xs:string*,
$new as xs:base64Binary*,
$last-modified as xs:dateTime) as xs:base64Binary

Rules

Returns an archive of the same format as $in with each of the given entries in $entries updated to the corresponding value in the sequence $new. If an entry is not found, a new entry for it is added to the end of the archive.

The relative order of all the existing and replaced entries within the archive is preserved. New entries appear at the end of the archive in the order in which they were specified in the call.

If specified, and the format supports it, the last-modified date for each of the updated entries will be set to $last-modified. In the absence of such a parameter, it is implementation-dependent whether last-modified information will be written on the updated entries. If such default last-modification is written, it should be comparable to the value of fn:current-dateTime() in an XSLT environment.

The uncompressed content, size and last-modified date of the remaining entries shall be the same as those for those entries before deletion. Compressed sizes may alter.

The compression methods of the updated entries shall be preserved.

When duplicate names appear in the entry list, the value of the entry in the resulting archive will be that of the value of $new corresponding to the last matching entry name.

Error Conditions

[arch:entry-data-mismatch] is raised if count($entries) ne count($new).

[arch:read-error] is raised if there was an unspecified problem in reading or creating the archive.

9 Creating an archive

new archives need to be created

9.1 arch:create

Summary

Returns a new archive with each of the given entries in $entries set to the corresponding values in the sequence $new.

Signatures

arch:create($entries as xs:string*, $new as xs:base64Binary*) as xs:base64Binary
arch:create($entries as xs:string*,
$new as xs:base64Binary*,
$options as element(arch:options)) as xs:base64Binary

Rules

Returns an archive of format specified by $options with each of the given entries in $entries set to the corresponding value in the sequence $new.

The relative order of new entries within the archive follows that of the input.

When duplicate names appear in the entry list, the value of the entry in the resulting archive will be that of the value of $new corresponding to the last matching entry name.

Error Conditions

[arch:entry-data-mismatch] is raised if count($entries) ne count($new).

[arch:read-error] is raised if there was an unspecified problem in reading or creating the archive.

10 Functions using XPath3.0 map() type

Maps proposed for XPath3.0 can increase the coherence of the functions in the module, mainly by retaining the structured connection between the entry name and its properties and content. In addition the properties of the overall archive (and its defaults for new entries) can similarly be defined in a single map.

This section proposes parallel functions to those above using maps.

Note:

map:keys($map as map(*)) as xs:anyAtomicType* returns the keys that are present in a map, in unpredictable order. This means that if order within an archive is important (either in extraction or updating) other mechanisms may be needed.

In general when using maps for denoting the entries to be manipulated, the arguments might be considered to be a (possibly empty) sequence of maps that are treated as if concatentated. [THIS NEEDS THOUGHT ABOUT OVERWRITING/MERGING COMMON KEYS]

10.1 Archive property maps

Using a reserved name within the overall map (such as arch:options) would allow the options/properties for an archive to be stored alongside the entries.

10.2 Entry property maps

Entries within the archive can be also be accessed or described by entries in a map (map(xs:string,map(xs:string,item()*))). In this case the map key gives the (path)name of the archive entry (e.g. build/build-j.xml) and the value is a map of the properties of that entry.

The following keys are provided when reporting on entries:

  • size: the original file size of the entry as xs:integer

  • compressed-size: the compressed file size of the entry as xs:integer, i.e. the number of bytes it occupies in the archive.

  • last-modified: the date of last modification of this entry, in xs:dateTime notation

  • compression-level: an indicator of the level of (lossless?) compression.

  • content: the value of the entry read from the archive, as xs:base64Binary. This will only be set if $return-content is requested in the call to archM:entries().

When used to extract an entry from an archive, this map may have the following optional key/value pairs:

  • encoding: the encoding to be used for converting textual items from a byte sequence.

When used to create or update an entry in an archive, this map may have the following optional key/value pairs:

  • content: the value of the entry to be written in the archive, either as xs:base64Binary or, when encoding is set, as xs:string.

    Note:

    This is awkward - why not just insist on xs:base64Binary and let the programmer encode?

  • last-modified: the date of last modification to be written on this entry, in xs:dateTime notation

  • compression-level: the level of (lossless?) compression to be used in writing the entry into the archive.

  • encoding: the encoding to be used for converting textual items to a byte sequence, prior to possible compression and writing to the archive.

10.3 archM:options

Summary

Returns a description of the type and properties of a given archive as a map.

Signature

archM:options($archive as xs:base64Binary) as map(xs:string,item()?)

Rules

The description is returned as a map map(xs:string,item()?) with entries describing the details. The following are currently supported:

  • format: format of this archive
  • compression: the compression algorithm that was used.

If the archive format supports a compression algorithm varying on a per-entry basis, and more than one algorithm has been used in the archive, mixed is returned for the compression entry.

Error Conditions

[arch:read-error] is raised if there is an unspecified problem in reading the archive.

Examples

Finding the properties of the archive stored in a file located at $uri:

archM:options(file:read-binary($uri))
=> {'format' :'zip', 'compression' : 'deflate'}

10.4 archM:entries

Summary

Returns the entry descriptors for all the entries found within the archive as a map, optionally each with their content.

Signatures

archM:entries($archive as xs:base64Binary) as map(xs:string,map(xs:string,item()*))
archM:entries($archive as xs:base64Binary,
$return-content as xs:boolean) as map(xs:string,map(xs:string,item()*))

Rules

Keys to the returned map are the entry (path) names.

The value for each map entry is a map describing the properties of that entry. For more details of this structure see 10.2 Entry property maps.

If $return-content is defined and equals true(), then the content for each entry is returned as the content entry in the property map, as a xs:base64Binary item.

Error Conditions

[arch:read-error] is raised if there is an unspecified problem in reading the archive.

Notes

As the returned order of keys from map:keys() is not defined and can be implementation-dependant, there may be a need for a simple function (archM:entry-names(xs:base64Binary) as xs:string*) which returns purely the names in the order in which they appear in the archive.

Using $return-content makes it possible to return a complete archive in a single call. (What about the archive options?

Examples

Finding the entries of the archive stored in a file located at $uri:

archM:entries(file:read-binary($uri))
=> map{ 
  "build.xml" := map{ "size":=2194, "compressed-size":=652, "last-modified":="2013-07-18T11:22:12"},
  "lumley.jpg" := map{ "size":=84983, "compressed-size":=84872, "last-modified":="2009-03-23T11:15:06"},
  "tests/qt3/binary/binary.xml" := map{ "size":=10058, "compressed-size":=1381, "last-modified":="2013-08-06T13:14:08"}}
     

Counting the number of apparent XML files in the previous example:

count(map:keys(archM:entries(file:read-binary($uri)))[ends-with(.,'.xml')])
=> 2
     

10.5 archM:entry-names

Summary

Returns the entry names for all the entries found within the archive as a sequence of string values in the order in which they appear in the archive.

Signature

archM:entry-names($archive as xs:base64Binary) as xs:string*

Rules

Returns the entry names for all the entries found within the archive as a sequence of string values in the order in which they appear in the archive.

Error Conditions

[arch:read-error] is raised if there is an unspecified problem in reading the archive.

10.6 archM:extract

Summary

Returns a copy of $entries with the content entries set to binary or decoded string data for the appropriate entry in the archive.

Signature

archM:extract($archive as xs:base64Binary,
$entries as map(xs:string,map(xs:string,item()?))) as map(xs:string,map(xs:string,item()?))

Rules

The map entries in $entries define whether binary or decoded string data is to be returned.

The behaviour of this function is defined by equivalent XPath:

map:new(for $k in map:keys($entries) 
   return 
     let $a := $entries($k),
         $text := map:contains($a,'encoding'),
         $encoding := ($a('encoding'),'UTF-8')[1],
         $data := arch:extract-binary($archive,$k) // error if not found
     return 
         map:entry($k,
             map:new(($a,
               map:entry('content',if($text) bin:decode-string($data,$encoding) else $data)
               ))
       )
     
Error Conditions

[arch:unknown-entry] is raised if an entry requested does not exist in this archive.

[arch:read-error] is raised if there was an unspecified problem in reading the archive.

Examples

To collect all the XML entries as XML:

let $archive := file:read-binary($uri)
    $entries := archM:entries($archive),
    $xml-names := map:keys($entries)[ends-with(.,'.xml')],
    $get := map:new($xml-names ! map:entry(.,map:entry('encoding','UTF-8'))),
    $content := archM:extract($archive,$get)
return
    $xml-names ! fn:parse-XML($content(.)('content'))
     

10.7 archM:extract-binary

Summary

Returns the sequence of requested entries from the archive as binary data.

Signatures

archM:extract-binary($archive as xs:base64Binary,
$entries as map(xs:string,map(xs:string,item()?))) as xs:base64Binary*
archM:extract-binary($archive as xs:base64Binary,
$entries as xs:string*) as xs:base64Binary*

Rules

Returns as binary data each entry in the archive $in that corresponds to the entry name input, or map:keys($entries), in sequence.

When $entries has type xs:string*, the entries must be returned in the order corresponding to that of the entries requested in $entries, not in the order in which they may exist in the archive.

When $entries has type xs:string*, multiple requests for the same entry will be honoured, with copies of the entry appearing in corresponding multiple locations in the output sequence .

Error Conditions

[arch:unknown-entry] is raised if an entry requested does not exist in this archive.

[arch:read-error] is raised if there was an unspecified problem in reading the archive.

Notes

Collection of all the entries as binary data can be accomplished using archM:entries($archive,true()) and collecting the 'content' entry from each of the returned maps.

The signatures with $entries instance of xs:string* are equivalent to arch:extract-binary().

10.8 archM:extract-text

Summary

Returns the sequence of requested entries from the archive as decoded string data.

Signatures

archM:extract-text($archive as xs:base64Binary,
$entries as map(xs:string,map(xs:string,item()?))) as xs:string*
archM:extract-text($archive as xs:base64Binary,
$entries as map(xs:string,map(xs:string,item()?)),
$encoding as xs:string) as xs:string*
archM:extract-text($archive as xs:base64Binary,
$entries as xs:string*) as xs:string*
archM:extract-text($archive as xs:base64Binary,
$entries as xs:string*,
$encoding as xs:string) as xs:string*

Rules

Returns as decoded string data each entry in the archive $in that corresponds to the entry name input, or map:keys($entries), in sequence.

When $entries has type xs:string*, the entries must be returned in the order corresponding to that of the entries requested in $entries, not in the order in which they may exist in the archive.

When $entries has type xs:string*, multiple requests for the same entry will be honoured, with copies of the entry appearing in corresponding multiple locations in the output sequence.

If $encoding is specified, or the field 'decoding' appears in the entry in $entries, the strings are decoded according to that encoding, otherwise UTF-8 encoding is assumed.

Error Conditions

[arch:unknown-entry] is raised if an entry requested does not exist in this archive.

[arch:unknown-encoding] is raised if an encoding requested is unknown or unsupported.

[arch:decoding-error] is raised if there was an error in decoding an entry.

[arch:read-error] is raised if there was an unspecified problem in reading the archive.

Notes

The signatures with $entries instance of xs:string* are equivalent to arch:extract-text().

10.9 archM:create

Summary

Returns a new archive with each of the given entries named as a key in $entries set to the corresponding value in $entries($key)('content').

Signatures

archM:create($entries as map(xs:string,map(xs:string,item()?))*) as xs:base64Binary
archM:create($entries as map(xs:string,map(xs:string,item()?))*,
$options as map(xs:string,item())) as xs:base64Binary

Rules

Returns an archive of format specified by $options with each of the given entries named as a key in $entries set to the corresponding value in $entries($key)('content')..

The relative order of new entries within the archive follows that of the input.

If $options is specified, the overall archive properties (and defaults for the entries) are set to those specified in the map.

Error Conditions

[arch:read-error] is raised if there was an unspecified problem in creating the archive.

10.10 archM:update

Summary

Returns an archive with each of the given entries in the keys of $entries updated to the corresponding values in the $entries($key)('content') and with other properties defined by $entries($key)(*). If an entry is not found, a new entry is added to the end of the archive.

Signatures

archM:update($archive as xs:base64Binary,
$entries as map(xs:string,map(xs:string,item()?))) as xs:base64Binary
archM:update($archive as xs:base64Binary,
$entries as map(xs:string,map(xs:string,item()?)),
$default as map(xs:string,item())) as xs:base64Binary

Rules

Returns an archive with each of the given entries in the keys of $entries updated to the corresponding values in the $entries($key)('content') and with other properties defined by $entries($key)(*). If an entry is not found, a new entry is added to the end of the archive.

If $options is specified, values will be used for the default properties for each entry, which may be overloaded by the property map for each individual entry.

The relative order of all the existing and replaced entries within the archive is preserved. New entries appear at the end of the archive in the order in which they were specified in the call.

The uncompressed content, size and last-modified date of the remaining entries shall be the same as those for those entries before deletion. Compressed sizes may alter.

The compression methods of the updated entries shall be preserved.

Error Conditions

[arch:read-error] is raised if there was an unspecified problem in reading or creating the archive.

Notes

Using the $default map a common compression method, last-modification date and similar can be set for a set of entries, whose minimal map entries are map{"content":=$content}

10.11 archM:delete

Summary

Returns an archive with the given entries deleted.

Signatures

archM:delete($archive as xs:base64Binary,
$entries as xs:string*) as xs:base64Binary
archM:delete($archive as xs:base64Binary,
$entries as map(xs:string,map(xs:string,item()))*) as xs:base64Binary

Rules

Returns an archive of the same format as $in with all the entries named in $entries or $entries!map:keys(.) deleted.

The relative order of the remaining entries within the archive is preserved.

The uncompressed content, size and last-modified date of the remaining entries shall be the same as those for those entries before deletion. Compressed sizes may alter.

Duplicate entries in $entries are ignored.

If $entries is the empty sequence, or an empty map, the original archive shall be returned.

Error Conditions

[arch:unknown-entry] is raised if an entry requested for deletion does not exist in this archive.

[arch:read-error] is raised if there was an unspecified problem in reading the archive.

Notes

Whilst the uncompressed entries remaining after deletion should of course be the same size and content as those before deletion, depending upon the (lossless) compression algorithm used, the compressed sizes and content might not be. In the absence of a special check, in these circumstances $in may not be identical to arch:delete($in,()). This needs discussion.

The signature with $entries as xs:string* is defined as a convenience, to avoid the creation of a simple map. Otherwise it is completely analagous to arch:delete(xs:base64Binary,xs:string*).

A References

EPUB
EPUB 3 Overview. International Digital Publishing Forum. Recommended Specification 11 October 2011.
EXPath File
File Module. Christian Grün and Matthias Brantner, editors. EXPath Candidate Module. 14 June 2012.
EXPath Binary
Binary Module. Jirka Kosek and John Lumley, editors. EXPath Candidate Module. 6 August 2013.
F&O 3.0
XPath and XQuery Functions and Operators 3.0. Michael Kay, editor. W3C Candidate Recommendation 21 May 2013.
GZIP
GZIP file format specification version 4.3. L. Peter Deutsch, 1996.
JSONiq
JSONiq - The JSON Query Language. FLWOR Foundation. 2013.
XML Schema 1.1 Part 2
W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. David Peterson et al, editors.W3C Recommendation 5 April 2012.
ZIP
ZIP File Format Specification.PKWare, Version 6.3.3, 1 September 2012.

B Summary of error conditions

arch:read-error
There was an general error in reading the archive
arch:unknown-entry
The specified entry does not exist in this archive.
arch:entry-data-mismatch
The sequence of entry names is not the same length as the sequence of updated values.
arch:unknown-encoding
The specified encoding is not supported.
arch:decoding-error
Error in decoding a string.