Binary Module w3c-designation EXPath Candidate Module 12 March 2013 XML Jirka Kosek jirka@kosek.cz

Copyright ©2013 Jirka Kosek, published by the EXPath Community Group under the W3C Community Contributor License Agreement (CLA). A human-readable summary is available.

This specification was published by the EXPath Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.

This proposal provides an API for XPath 2.0 to handle binary data. It defines extension functions to read binary files, perform basic binary operations on the data in memory, as well as a new serialization method. It has been designed to be compatible with XQuery 1.0 and XSLT 2.0, as well as any other XPath 2.0 usage.

en-US

revisiondesc

Status of this document

This document is in an early draft stage. Comments are welcomed at public-expath@w3.org mailing list (archive).

Introduction Namespace Conventions

The module defined by this document defines several functions and a serialization method, all contained in the namespace http://expath.org/ns/binary. In this document, the bin prefix, when used, is bound to this namespace URI.

Error codes are defined in the namespace http://expath.org/ns/error. In this document, the err prefix, when used, is bound to this namespace URI.

Use cases

Development of this specification was driven strictly by requirements which XML developer regularly faces.

Some typical use cases:

Getting dimensions of an image file.

Extracting image metadata.

Processing images embeded and base64 encoded inside SOAP message.

Processing legacy text file which uses several different encodings in different places.

Loading binary data bin:unparsed-binary

The bin:unparsed-binary function reads an external resource (for example, an image file) and returns a binary data of the resource.

The $href argument must be a string in the form of a URI reference, which must contain no fragment identifier, and must identify a resource for which a string representation is available. If the URI is a relative URI reference, then it is resolved relative to the Static Base URI property from the static context.

If the value of the $href argument is an empty sequence, the function returns an empty sequence.

The result of the function is a binary value of the resource retrieved using the URI.

Converting external images in HTML document into internal data: resources:

<xsl:template match="img[ends-with(@src, '.jpg')]"> <xsl:copy> <xsl:copy-of select="@* except @src"/> <xsl:attribute name="src"> <xsl:text>data:image/jpeg;base64,</xsl:text> <xsl:value-of select="xs:base64Binary(bin:unparsed-binary(resolve-uri(@src)))"/> </xsl:attribute> </xsl:copy> </xsl:template>
Basic operations bin:binary-subsequence

The bin:binary-subsequence functions returns specified part of binary data.

Returns part of original binary data starting at $offset. Size of returned data is $size octets.

The $offset is zero based.

The value of $offset argument must be non-negative integer.

It is dynamic error if $offset + $size is larger then size of binary data passed in $in argument.

Testing whether $data variable contains content of PDF file:

bin:binary-subsequence($data, 0, 4) eq xs:hexBinary("25504446")

25504446 is magic number for PDF files, it is US-ASCII encoded value for %PDF.

bin:binary-length

The bin:binary-length functions returns size of binary data.

Returns size of binary data in octets.

bin:binary-join

Returns a binary data created by concatenating the binary data items in a sequence.

The function returns an xs:hexBinary created by concatenating the items in the sequence $in, in order.

If the value of $in is the empty sequence, the function returns the zero-length binary data.

bin:binary-to-octets

Returns binary data as a sequence of octets.

If $in is zero length binary data then empty sequence is returned.

Octets are returned as integers from 0 to 255.

bin:octets-to-binary

Converts sequence of octets into binary data.

Octets are integers from 0 to 255.

Text decoding and encoding bin:decode-string

Decodes binary data as a string in a given encoding.

The $encoding argument is the name of an encoding. The values for this attribute follow the same rules as for the encoding attribute in an XML declaration. The only values which every implementation is required to recognize are utf-8 and utf-16.

bin:unpack-string

Decodes chunk of binary data at a specified offset as a string in a given encoding.

If $size is greater then 0 this function is identical to calling bin:decode-string(bin:binary-subsequence($in, $offset, $size), $encoding).

If $size is zero then all non-zero octets starting at $offset until first zero octet are extracted and then decoding is applied. This way zero-terminated strings can be easily decoded.

The $encoding argument is the name of an encoding. The values for this attribute follow the same rules as for the encoding attribute in an XML declaration. The only values which every implementation is required to recognize are utf-8 and utf-16.

bin:encode-string

Encodes string into binary data using a given encoding.

The $encoding argument is the name of an encoding. The values for this attribute follow the same rules as for the encoding attribute in an XML declaration. The only values which every implementation is required to recognize are utf-8 and utf-16.

Packing and unpacking of encoded numeric values bin:unpack-double

Extract double value stored at the particular offset in binary data.

Little endian number representation is assumed unless $bigendian parameter is specified and has true() value.

The $offset is zero based.

bin:unpack-float

Extract float value stored at the particular offset in binary data.

Little endian number representation is assumed unless $bigendian parameter is specified and has true() value.

The $offset is zero based.

bin:unpack-long

Extract long (64-bit signed integer) value stored at the particular offset in binary data.

Little endian number representation is assumed unless $bigendian parameter is specified and has true() value.

The $offset is zero based.

bin:unpack-unsignedLong

Extract unsignedLong (64-bit unsigned integer) value stored at the particular offset in binary data.

Little endian number representation is assumed unless $bigendian parameter is specified and has true() value.

The $offset is zero based.

bin:unpack-int

Extract int (32-bit signed integer) value stored at the particular offset in binary data.

Little endian number representation is assumed unless $bigendian parameter is specified and has true() value.

The $offset is zero based.

bin:unpack-unsignedInt

Extract unsignedInt (32-bit unsigned integer) value stored at the particular offset in binary data.

Little endian number representation is assumed unless $bigendian parameter is specified and has true() value.

The $offset is zero based.

bin:unpack-short

Extract short (16-bit signed integer) value stored at the particular offset in binary data.

Little endian number representation is assumed unless $bigendian parameter is specified and has true() value.

The $offset is zero based.

bin:unpack-unsignedShort

Extract unsignedShort (16-bit unsigned integer) value stored at the particular offset in binary data.

Little endian number representation is assumed unless $bigendian parameter is specified and has true() value.

The $offset is zero based.

bin:unpack-byte

Extract byte (8-bit signed integer) value stored at the particular offset in binary data.

The $offset is zero based.

bin:unpack-unsignedByte

Extract unsignedByte (8-bit unsigned integer) value stored at the particular offset in binary data.

The $offset is zero based.

bin:pack-double

Returns binary representation of a double value.

Little endian number representation is assumed unless $bigendian parameter is specified and has true() value.

bin:pack-float

Returns binary representation of a float value.

Little endian number representation is assumed unless $bigendian parameter is specified and has true() value.

bin:pack-long

Returns binary representation of a long value.

Little endian number representation is assumed unless $bigendian parameter is specified and has true() value.

bin:pack-unsignedLong

Returns binary representation of an unsignedLong (64-bit unsigned integer) value.

Little endian number representation is assumed unless $bigendian parameter is specified and has true() value.

bin:pack-int

Returns binary representation of an int (32-bit signed integer) value.

Little endian number representation is assumed unless $bigendian parameter is specified and has true() value.

bin:pack-unsignedInt

Returns binary representation of an unsignedInt (32-bit unsigned integer) value.

Little endian number representation is assumed unless $bigendian parameter is specified and has true() value.

bin:pack-short

Returns binary representation of a short (16-bit signed integer) value.

Little endian number representation is assumed unless $bigendian parameter is specified and has true() value.

bin:pack-unsignedShort

Returns binary representation of an unsignedShort (16-bit unsigned integer) value.

Little endian number representation is assumed unless $bigendian parameter is specified and has true() value.

bin:pack-byte

Returns binary representation of a byte (8-bit signed integer) value.

bin:pack-unsignedByte

Returns binary representation of an unsignedByte (8-bit unsigned integer) value.

Bitwise operations bin:binary-or

Returns "bitwise or" applied on arguments.

Returns "bitwise or" applied on $a and $b.

If $a and $b do not have same length then shorter is padded with zero octets to match size of a longer argument.

bin:binary-xor

Returns "bitwise xor" applied on arguments.

Returns "bitwise exclusive or" applied on $a and $b.

If $a and $b do not have same length then shorter is padded with zero octets to match size of a longer argument.

bin:binary-and

Returns "bitwise and" applied on arguments.

Returns "bitwise and" applied on $a and $b.

If $a and $b do not have same length then shorter is padded with zero octets to match size of a longer argument.

bin:binary-not

Returns "bitwise not" of an argument.

Returns "bitwise not" applied on $in argument.

bin:binary-shift

Shift bits in binary data.

If $by is positive then bits are shifted $by times to the left.

If $by is negative then bits are shifted -$by times to the right.

If $by is zero result is identical to $in argument.

Result has always the same size as $in argument.

Shift is logical, zeros are placed into discarded bits.

Serialization

New serialization method bin:binary is defined. It can serialize sequence containing only items of type xs:hexBinary or xs:base64Binary. Such sequence is turned into one block of binary data using bin:binary-join and written out to the specified location.

Joining several blobs of data into a single file:

<xsl:result-document href="image.png" method="bin:binary"> <xsl:sequence select="$image-header"/> <xsl:sequence select="$image-data"/> </xsl:result-document>

Template for extracting HTML images represented as data: URI scheme into separate external image files:

<xsl:template match="img[starts-with(@src, 'data:image/png;base64,')]"> <xsl:copy> <xsl:copy-of select="@* except @src"/> <xsl:attribute name="src" select="concat(generate-id(), '.png')"/> <xsl:result-document href="{generate-id()}.png" method="bin:binary"> <xsl:sequence select="xs:base64Binary(substring-after(@src, 'data:image/png;base64,'))"/> </xsl:result-document> </xsl:copy> </xsl:template>
References XSLT 2.0 and XQuery 1.0 Serialization. Scott Boag, Michael Kay, Joanne Tong, Norman Walsh, and Henry Zongaro, editors. W3C Recommendation. 23 January 2007. XPath and XQuery Functions and Operators 1.1. Michael Kay, editor. W3C Working Draft. 15 January 2009. XSL Transformations (XSLT) Version 2.0. Michael Kay, editor. W3C Recommendation. 23 January 2007. Summary of Error Conditions

Proper error codes and conditions will be defined in the next version of this draft.