The presentation of this document has been augmented to identify changes from a previous version. Three kinds of changes are highlighted: new, added text, changed text, and deleted text.


EXPath

HTTP Client Module

EXPath Candidate Module 9 January 2010

Candidate 9 January 2010

This version:
Latest version:
http://expath.org/spec/http-client
Previous versions:

http://expath.org/spec/http-client/20090302
http://expath.org/spec/http-client/20090112
Editor:
Florent Georges, H2O Consulting

This document is also available in these non-normative formats: XML and Revision markup.


Abstract

This proposal provides an HTTP client interface for XPath 2.0. It defines one extension function to perform HTTP requests, and has been designed to be compatible with XQuery 1.0 and XSLT 2.0, as well as any other XPath 2.0 usage.

Table of Contents

1 Introduction
    1.1 Namespace conventions
    1.2 Error management
2 The http:send-request function
3 Sending a request
    3.1 The request elements
    3.2 Serializing the request content
    3.3 Authentication
4 Dealing with the response
    4.1 The result element
    4.2 Representing the result content
5 Content types handling

Appendices

A References
B Summary of Error Conditions


1 Introduction

TODO: Add @href to http:response, taking redirects into account.

1.1 Namespace conventions

The module defined by this document does define one function in the namespace http://expath.org/ns/http-client. In this document, the http prefix, when used, is bound to this namespace URI.

Error codes are defined in the namespace http://expath.org/ns/error. In this document, the err prefix, when used, is bound to this namespace URI.

1.2 Error management

Error conditions are identified by a code (a QName). When such an error condition is reached during the execution of the function, a dynamic error is thrown, with the corresponding error code (as if the standard XPath function error had been called). TODO: Have not been defined yet.

Error codes are defined through the spec. For too many reasons to enumerate here, the HTTP protocol layer can raise an error. In this case, if the error condition is not mentioned explicitly in the spec, the implementation must raise an error with an appropriate message [err:HC001].

2 The http:send-request function

This module defines an XPath extension function that sends an HTTP request and return the corresponding response. It supports HTTP multi-part messages. Here is the signature of this function:

http:send-request($request as element(http:request)?,
                  $href as xs:string?,
                  $content as item()?,
                  $bodies as item()*) as item()+

Besides the 3-params signature above, there are 2 other signatures that are convenient shortcuts (corresponding to the full version in which corresponding params have been set to the empty sequence). They are:

http:send-request($request as element(http:request)) as item()+
http:send-request($request as element(http:request)?,
                  $href as xs:string?) as item()+
http:send-request($uri as xs:string?,
                  $request as element(http:request)?,
                  $content as item()?) as item()+

3 Sending a request

The functions defined in this module make one able to send a request to an HTTP server and receive the corresponding response. Here is how the request is represented by the parameters to this function, and how they are used to generate the actual HTTP request to send.

3.1 The request elements

The http:request element represents all the needed information to send the HTTP request. So it is always possible to create such an element that will carry over all the needed info for a particular request. For some of those values though, you can use an additional param instead. For instance, some signatures define the parameter $href. If the value of this parameter is not the empty sequence, it will then be used instead of the value of the attribute href on the http:request element.

<http:request method = ncname
              href? = uri
              status-only? = boolean
              username? = string
              password? = string
              auth-method? = string
              send-authorization? = boolean
              override-media-type? = string
              follow-redirect? = boolean
              timeout? = integer>
   (http:header*,
     (http:multipart|
      http:body)?)
</http:request>
  • method is the HTTP verb to use, as GET, POST, etc. It is case insensitive

  • href is the URI the request has to be sent to. It can be overridden by the parameter $href

  • status-only control how the response will look like; if it is true, only the status code and the headers are returned, the content is not (no http:body nor http:multipart, nor the interpreted additional value in the returned sequence, see hereafter).

  • username, password, auth-method and send-authorization are used for authentication (see section below).

  • override-media-type is a MIME type. It can be used only with http:request, and will override the Content-Type header returned by the server.

  • follow-redirect control whether an HTTP redirect is automatically followed or not. If it is false, the HTTP redirect is returned as the response. If it is true (the default) the function tries to follow the redirect, by sending the same request to the new address (including body, headers, and authentication credentials). Maximum one redirect is followed (there is no attempt to follow a redirect in response to following a first redirect).

  • timeout is the maximum number of seconds to wait for the server to respond. If this time duration is reached, an error is thrown [err:HC006]. (TODO: Allow one to ask for an empty sequence instead?)

  • http:header represent an HTTP header, either in the http:request or in the http:response elements, as defined below.

  • http:multipart represents a multi-part body, either in a request or a response, as defined below.

  • http:body represents the multi-part body, either of a request or a response, as defined below. It can be overridden by the parameter $content (the way $content is used to build the body can be controlled by the parameter $serial, see section below for details.)

The http:header element represents an HTTP header, either in a request or in a response:

<http:header name = string
             value = string/>

The http:body element represents the body of either an HTTP request or of an HTTP response (in multipart requests and responses, it represents the body of a single one part):

<http:body id? = string
           description? = string
           media-type = string
           src? = uri
           method? = "xml" | "html" | "xhtml" | "text" | "binary"
             | qname-but-not-ncname
           byte-order-mark? = "yes" | "no"
           cdata-section-elements? = qnames
           doctype-public? = string
           doctype-system? = string
           encoding? = string
           escape-uri-attributes? = "yes" | "no"
           indent? = "yes" | "no"
           normalization-form? = "NFC" | "NFD" | "NFKC" | "NFKD"
             | "fully-normalized" | "none" | nmtoken
           omit-xml-declaration? = "yes" | "no"
           standalone? = "yes" | "no" | "omit"
           suppress-indentation? = qnames
           undeclare-prefixes? = "yes" | "no"
           output-version? = nmtoken>
   any*
</http:body>

The media-type is the MIME media type of the body part. It is mandatory. In a request it is given by the user and is the default value of the Content-Type header if it is not set explicitly. In a response, it is given by the implementation from the Content-Type header returned by the server. The src attribute can be used in a request to set the body content as the content of the linked resource instead of using the children of the http:body element. When this attribute is used, only the media-type attribute must also be present, and there can be neither content in the http:body element, nor any other attribute, or this is an error [err:HC004].

All the attributes, except src, are used to set the corresponding serialization parameter defined in [Serialization], as defined for the XPath 2.1 function fn:serialize() [F&O 1.1]. A difference here is that the serialization parameter include-content-type does not make sense, so it is not available on the http:body element (its value is always "yes"). Those attributes can be given by the user on a request to control the way a part body is serialized. In the response, the implementation can, but is not required, to provide some of them if it has the corresponding information (some of them do not make any sense in a response, therefore they will never be on a response element, for instance output-version).

The content-type and encoding attributes are used to control the way the content of this element is used to create the HTTP request (how it is serialized to the request content.) See section below for details. The id attribute specifies the value of the HTTP header Content-ID and description the value of the HTTP header Content-Description. The href attribute can be used in a request to set the body content as the content of the linked resource instead of using the children of the http:body element (children of this element and the href attribute are mutually exclusive.)

The http:multipart element represents an HTTP multi-part request or response:

<http:multipart media-type = string
                boundary? = string>
   (http:header*,
    http:body)+
</http:multipart>

The media-type attribute is the media type of the whole request or response, and has to be a multipart media type (that is, its main type must be multipart). The boundary attribute is the boundary marker used to separate the several parts in the message (the value of the attribute is prefixed with "--" to form the actual boundary marker in the request; on the other way, this prefix is removed from the boundary marker in the response to set the value of the attribute).

3.2 Serializing the request content

If the request can have content (one body or several body parts), it can be specified by the http:multipart element, the http:body element, and/or the parameter $bodies. If $content is not the empty sequence, it replaces the value of the http:body element (in multipart, if there are several bodies, exactly one http:body must be empty). For each body, the content of the HTTP body is generated as follow.

Except when its attribute src is present, a http:request element can have several attributes representing serialization parameters, as defined in [Serialization]. This spec defines in addition the method 'binary'; in this case the body content must be either an xs:hexBinary or an xs:base64Binary item, and no other serialization parameter can be set besides media-type.

The default value of the serialization method depends on the media-type: it is 'xml' if it is an XML media type, 'html' if it is an HTML media type, 'xhtml' if it is application/xhtml+xml, 'text' if it is a textual media type, and 'binary' for any other case.

When a body element has an empty content (i.e. it has no child node at all) its content is given by the parameter $bodies. In a single part request, this param must have at most one item. If the body is empty, the param cannot be the empty sequence. In a multipart request, $bodies must have as many items as there are empty body elements. If there are three empty body elements, the content of the first of them is $bodies[1], and so on. The number of empty body elements must be equal to the number of items in $bodies.

The parameter $serial is used to control the way the content is serialized. This parameter can be an xsl:output element, as defined in [XSLT 2.0], and the serialization is defined in [Serialization]. $serial can also be a string, either 'xml', 'html', 'xhtml' or 'text' (other values are implementation-defined, as explained in the above mentioned recommendations.) (Note: $serial should be able to be a function item too, when EXPath will have defined the corresponding module.) If $serial is the empty sequence, the default value for this parameter depends on the content-type of the body: it is 'xml' if it is an XML media type, 'html' if it is an HTML media type, 'xhtml' if it is application/xhtml+xml or 'text' for any other case.

3.3 Authentication

HTTP authentication when sending a request is controlled by the attributes username, password, auth-method and send-authorization on the element http:request. If username has a value, password and auth-method must have a value too. And if any one of the three other attributes have been set, username must be set too.

The attribute auth-method can be either "Basic" or "Digest", but other values can also be used, in an implementation-defined way. The handling of those attributes must be done in conformance to [RFC 2617]. If send-authorization is true (default value is false) and the authentication method supports generating the header Authorization without challenge, the request contains this header. The default value is to send a non-authenticated request, and if the response is an authentication challenge, then only send the credentials in a second message.

4 Dealing with the response

After having sent the request to the HTTP server, the function waits for the response. It analyses it and returns a sequence representing this response. This sequence has an http:response element as first item, which is followed be an additional item for each body or body part in the response.

4.1 The result element

<http:response status = integer
                  message = string>
   (http:header*,
     (http:multipart |
      http:body)?)
</http:response>

This is the first item returned by the function defined in this module. The status attribute is the HTTP status code returned by the server, and message is the message coming with the status on the status line. The http:header elements are as defined for the request, but represent instead the response headers. The http:body and http:multipart elements are also like in the request, but http:body elements must be empty.

4.2 Representing the result content

Instead of being inserted within the http:response element, the content of each body is returned as a single item in the return sequence. Each item is in the same order (after the http:response element) than the http:body elements. For each body, the way this item is built from the HTTP response is as follow.

If the status-only attribute has the value true (default is false), the returned sequence will only contain the http:response element (with the headers, but also the empty http:body or http:multipart elements, as if status-only was false), and the following items, representing the bodies content are not generated from the HTTP response.

For each body that has to be interpreted, the following rules apply in order to build the corresponding item. If the body media type is a text media type, the item is a string, containing the body content. If the media type is an XML media type, the content is parsed and the item is the resulting document node. If the media type is an HTML type, the content is tidied up and parsed (this process is implementation-dependant) and the item is the resulting document node. If this is a binary media type, the content is returned as a base64Binary item. From the previous rules, a result item can then be either a document node (from XML or HTML), a string or a base64Binary.

When the type of a part is either XML or HTML, its body has to be parsed into a document node. If there is any error when parsing the content, an error is raised with an appropriate message [err:HC002].

If the attribute override-media-type is set on the request, its value is used instead of the Content-Type returned by the HTTP server. If the Content-Type of the response is a multipart type, the value of override-media-type can only be a multipart type, or application/octet-stream (to get the raw entity as a binary item). If it is not, this is an error [err:HC003].(TODO: how does it fit with multipart responses?)

5 Content types handling

In both requests and responses, MIME type strings are used to choose the way the entity content has to be respectively serialized or parsed. Four different kinds of type are defined here, which are used in the above text about sending request and receiving response. The intent is to provide the spirit of the entity content handling regarding its content type, but an implementation is encouraged to deviate from those rules if it is obvious that a particular type should be treated in a specific way (normally, that would be the case only to treat a binary type as another type).

A References

The structure of most of the elements and most of the attributes used in this candidate are inspired from the corresponding step in [XProc].

HTML Tidy
HTML Tidy Library Project. SourceForge project.
RFC 1521
RFC 1521: MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies. N. Borenstein, N. Freed, editors. Internet Engineering Task Force. September, 1993.
RFC 2616
RFC 2616: Hypertext Transfer Protocol — HTTP/1.1. R. Fielding, J. Gettys, J. Mogul, et. al., editors. Internet Engineering Task Force. June, 1999.
RFC 2617
RFC 2617: HTTP Authentication: Basic and Digest Access Authentication. J. Franks, P. Hallam-Baker, J. Hostetler, S. Lawrence, P. Leach, A. Luotonen, L. Stewart. June, 1999.
RFC 3023
RFC 3023: XML Media Types. M. Murata, S. St. Laurent, and D. Kohn, editors. Internet Engineering Task Force. January, 2001.
RFC 4918
RFC 4918: HTTP Extensions for Web Distributed Authoring and Versioning (WebDAV). L. Dusseault, editor. Internet Engineering Task Force. June, 2007.
Serialization
XSLT 2.0 and XQuery 1.0 Serialization. Scott Boag, Michael Kay, Joanne Tong, Norman Walsh, and Henry Zongaro, editors. W3C Recommendation. 23 January 2007.
F&O 1.1
XPath and XQuery Functions and Operators 1.1. Michael Kay, editor. W3C Working Draft. 15 January 2009.
TagSoup
TagSoup - Just Keep On Truckin'. John Cowan.
XProc
XProc: An XML Pipeline Language. N. Walsh, A. Milowski, and H. S. Thompson, editors. W3C Candidate Recommendation. 26 November 2008.
XSLT 2.0
XSL Transformations (XSLT) Version 2.0. Michael Kay, editor. W3C Recommendation. 23 January 2007.

B Summary of Error Conditions

err:HC001
An HTTP error occurred.
err:HC002
Error parsing the entity content as XML or HTML.
err:HC003
With a multipart response, the override-media-type must be either a multipart media type or application/octet-stream.
err:HC004
The src attribute on the body element is mutually exclusive with all other attribute (except the media-type).
err:HC005
The request element is not valid.
err:HC006
A timeout occurred waiting for the response.