compare.txt   compare.txt 
Network Working Group C. Newman Network Working Group C. Newman
Internet-Draft Sun Microsystems Internet-Draft Sun Microsystems
Expires: April 26, 2004 October 27, 2003 Expires: February 2, 2007 M. Duerst
AGU
A. Gulbrandsen
Oryx
August 1, 2006
Internet Application Protocol Collation Registry Internet Application Protocol Collation Registry
draft-newman-i18n-comparator-01.txt draft-newman-i18n-comparator-13.txt
Status of this Memo Status of this Memo
This document is an Internet-Draft and is in full conformance with By submitting this Internet-Draft, each author represents that any
all provisions of Section 10 of RFC2026. applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other Task Force (IETF), its areas, and its working groups. Note that
groups may also distribute working documents as Internet-Drafts. other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at http:// The list of current Internet-Drafts can be accessed at
www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on April 26, 2004. This Internet-Draft will expire on February 2, 2007.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2003). All Rights Reserved. Copyright (C) The Internet Society (2006).
Abstract Abstract
Many Internet application protocols include string-based lookup, Many Internet application protocols include string-based lookup,
searching, or sorting operations. However the problem space for searching, or sorting operations. However the problem space for
searching and sorting international strings is large, not fully searching and sorting international strings is large, not fully
explored, and is outside the area of expertise for the Internet explored, and is outside the area of expertise for the Internet
Engineering Task Force (IETF). Rather than attempt to solve such a Engineering Task Force (IETF). Rather than attempt to solve such a
large problem, this specification creates an abstraction framework so large problem, this specification creates an abstraction framework so
that application protocols can precisely identify a comparison that application protocols can precisely identify a comparison
function and the repertoire of comparison functions can be extended function and the repertoire of comparison functions can be extended
in the future. in the future.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1 Conventions Used in this Document . . . . . . . . . . . . . 3 1.1. Conventions Used in this Document . . . . . . . . . . . . 4
2. Collation Definition and Purpose . . . . . . . . . . . . . . 3 2. Collation Definition and Purpose . . . . . . . . . . . . . . . 4
3. Collation Name Syntax . . . . . . . . . . . . . . . . . . . 4 2.1. Definition . . . . . . . . . . . . . . . . . . . . . . . 4
4. Collation Specification Requirements . . . . . . . . . . . . 6 2.2. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 4
5. Application Protocol Requirements . . . . . . . . . . . . . 8 2.3. Some Other Terms Used in this Document . . . . . . . . . 5
6. Initial Collations . . . . . . . . . . . . . . . . . . . . . 9 2.4. Sort Keys . . . . . . . . . . . . . . . . . . . . . . . . 5
6.1 Octet Collation . . . . . . . . . . . . . . . . . . . . . . 9 3. Collation Identifier Syntax . . . . . . . . . . . . . . . . . 6
6.2 ASCII Numeric Collation . . . . . . . . . . . . . . . . . . 10 3.1. Basic Syntax . . . . . . . . . . . . . . . . . . . . . . 6
6.3 ASCII Casemap Collation . . . . . . . . . . . . . . . . . . 10 3.2. Wildcards . . . . . . . . . . . . . . . . . . . . . . . . 6
6.4 Nameprep Collation . . . . . . . . . . . . . . . . . . . . . 11 3.3. Ordering Direction . . . . . . . . . . . . . . . . . . . 6
6.5 Basic Collation . . . . . . . . . . . . . . . . . . . . . . 12 3.4. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 7
7. Use by ACAP and Sieve . . . . . . . . . . . . . . . . . . . 14 3.5. Naming Guidelines . . . . . . . . . . . . . . . . . . . . 7
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . 14 4. Collation Specification Requirements . . . . . . . . . . . . . 8
8.1 Collation Registration Procedure . . . . . . . . . . . . . . 14 4.1. Collation/Server Interface . . . . . . . . . . . . . . . 8
8.2 Collation Registration Template . . . . . . . . . . . . . . 15 4.2. Operations Supported . . . . . . . . . . . . . . . . . . 8
8.3 Octet Collation Registration . . . . . . . . . . . . . . . . 16 4.2.1. Validity . . . . . . . . . . . . . . . . . . . . . . . 8
8.4 ASCII Numeric Collation Registration . . . . . . . . . . . . 16 4.2.2. Equality . . . . . . . . . . . . . . . . . . . . . . . 9
8.5 Legacy English Casemap Collation Registration . . . . . . . 16 4.2.3. Substring . . . . . . . . . . . . . . . . . . . . . . 9
8.6 English Casemap Collation Registration . . . . . . . . . . . 16 4.2.4. Ordering . . . . . . . . . . . . . . . . . . . . . . . 10
8.7 Nameprep Collation Registration . . . . . . . . . . . . . . 17 4.3. Sort Keys . . . . . . . . . . . . . . . . . . . . . . . . 10
8.8 Basic Collation Registration . . . . . . . . . . . . . . . . 17 4.4. Use of Lookup Tables . . . . . . . . . . . . . . . . . . 10
8.9 Basic Accent Sensitive Match Collation Registration . . . . 17 5. Application Protocol Requirements . . . . . . . . . . . . . . 11
8.10 Basic Case Sensitive Match Collation Registration . . . . . 18 5.1. Character Encoding . . . . . . . . . . . . . . . . . . . 11
8.11 Structure of Collation Registry . . . . . . . . . . . . . . 18 5.2. Operations . . . . . . . . . . . . . . . . . . . . . . . 11
8.12 Example Initial Registry Summary . . . . . . . . . . . . . . 19 5.3. Wildcards . . . . . . . . . . . . . . . . . . . . . . . . 12
9. DTD for Collation Registration . . . . . . . . . . . . . . . 19 5.4. Canonicalization Function . . . . . . . . . . . . . . . . 12
10. Guidelines for Expert Reviewer . . . . . . . . . . . . . . . 20 5.5. Disconnected Clients . . . . . . . . . . . . . . . . . . 12
11. Security Considerations . . . . . . . . . . . . . . . . . . 21 5.6. Error Codes . . . . . . . . . . . . . . . . . . . . . . . 12
12. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . 21 5.7. Octet Collation . . . . . . . . . . . . . . . . . . . . . 13
13. Changes From -00 . . . . . . . . . . . . . . . . . . . . . . 22 6. Use by Existing Protocols . . . . . . . . . . . . . . . . . . 13
Normative References . . . . . . . . . . . . . . . . . . . . 22 7. Collation Registration . . . . . . . . . . . . . . . . . . . . 13
Informative References . . . . . . . . . . . . . . . . . . . 23 7.1. Collation Registration Procedure . . . . . . . . . . . . 13
Author's Address . . . . . . . . . . . . . . . . . . . . . . 24 7.2. Collation Registration Format . . . . . . . . . . . . . . 14
Intellectual Property and Copyright Statements . . . . . . . 25 7.2.1. Registration Template . . . . . . . . . . . . . . . . 14
7.2.2. The collation Element . . . . . . . . . . . . . . . . 15
7.2.3. The identifier Element . . . . . . . . . . . . . . . . 15
7.2.4. The title Element . . . . . . . . . . . . . . . . . . 15
7.2.5. The operations Element . . . . . . . . . . . . . . . . 15
7.2.6. The specification Element . . . . . . . . . . . . . . 15
7.2.7. The submitter Element . . . . . . . . . . . . . . . . 16
7.2.8. The owner Element . . . . . . . . . . . . . . . . . . 16
7.2.9. The version Element . . . . . . . . . . . . . . . . . 16
7.2.10. The variable Element . . . . . . . . . . . . . . . . . 16
7.2.11. The name Element . . . . . . . . . . . . . . . . . . . 16
7.2.12. The default Element . . . . . . . . . . . . . . . . . 16
7.2.13. The value Element . . . . . . . . . . . . . . . . . . 17
7.3. Structure of Collation Registry . . . . . . . . . . . . . 17
7.4. Example Initial Registry Summary . . . . . . . . . . . . 18
8. Guidelines for Expert Reviewer . . . . . . . . . . . . . . . . 18
9. Initial Collations . . . . . . . . . . . . . . . . . . . . . . 19
9.1. ASCII Numeric Collation . . . . . . . . . . . . . . . . . 19
9.1.1. ASCII Numeric Collation Description . . . . . . . . . 19
9.1.2. ASCII Numeric Collation Registration . . . . . . . . . 20
9.2. ASCII Casemap Collation . . . . . . . . . . . . . . . . . 20
9.2.1. ASCII Casemap Collation Description . . . . . . . . . 20
9.2.2. ASCII Casemap Collation Registration . . . . . . . . . 21
9.3. Nameprep Collation . . . . . . . . . . . . . . . . . . . 21
9.3.1. Nameprep Collation Description . . . . . . . . . . . . 21
9.3.2. Nameprep Collation Registration . . . . . . . . . . . 22
9.4. Octet Collation . . . . . . . . . . . . . . . . . . . . . 22
9.4.1. Octet Collation Description . . . . . . . . . . . . . 22
9.4.2. Octet Collation Registration . . . . . . . . . . . . . 23
10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 23
11. Security Considerations . . . . . . . . . . . . . . . . . . . 23
12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 23
13. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . 23
14. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 24
14.1. Changes From -12 . . . . . . . . . . . . . . . . . . . . 24
14.2. Changes From -11 . . . . . . . . . . . . . . . . . . . . 24
14.3. Changes From -10 . . . . . . . . . . . . . . . . . . . . 24
14.4. Changes From -09 . . . . . . . . . . . . . . . . . . . . 24
14.5. Changes From -08 . . . . . . . . . . . . . . . . . . . . 25
14.6. Changes From -06 . . . . . . . . . . . . . . . . . . . . 26
14.7. Changes From -05 . . . . . . . . . . . . . . . . . . . . 26
14.8. Changes From -04 . . . . . . . . . . . . . . . . . . . . 26
14.9. Changes From -03 . . . . . . . . . . . . . . . . . . . . 26
14.10. Changes From -02 . . . . . . . . . . . . . . . . . . . . 27
14.11. Changes From -01 . . . . . . . . . . . . . . . . . . . . 27
14.12. Changes From -00 . . . . . . . . . . . . . . . . . . . . 27
15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 28
15.1. Normative References . . . . . . . . . . . . . . . . . . 28
15.2. Informative References . . . . . . . . . . . . . . . . . 28
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 30
Intellectual Property and Copyright Statements . . . . . . . . . . 31
1. Introduction 1. Introduction
The ACAP [11] specification introduced the concept of a comparator The ACAP [12] specification introduced the concept of a comparator
(which we call collation in this document), but failed to create an (which we call collation in this document), but failed to create an
IANA registry. With the introduction of stringprep [6] and the IANA registry. With the introduction of stringprep [6] and the
Unicode Collation Algorithm [8], it is now time to create that Unicode Collation Algorithm [8], it is now time to create that
registry and populate it with some initial values appropriate for an registry and populate it with some initial values appropriate for an
international community. This specification replaces and generalizes international community. This specification replaces and generalizes
the definition of a comparator in ACAP and creates a collation the definition of a comparator in ACAP and creates a collation
registry. registry.
1.1 Conventions Used in this Document 1.1. Conventions Used in this Document
The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY" The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY"
in this document are to be interpreted as defined in "Key words for in this document are to be interpreted as defined in "Key words for
use in RFCs to Indicate Requirement Levels" [1]. use in RFCs to Indicate Requirement Levels" [1].
The attribute syntax specifications use the Augmented Backus-Naur The attribute syntax specifications use the Augmented Backus-Naur
Form (ABNF) [2] notation including the core rules defined in Appendix Form (ABNF) [2] notation including the core rules defined in Appendix
A. This also inherits ABNF rules from Language Tags [5]. A. This also inherits ABNF rules from Language Tags [5].
2. Collation Definition and Purpose
2. Collation Definition and Purpose 2.1. Definition
A collation is a named function which takes two arbitrary length A collation is a named function which takes two arbitrary length
octet strings (encoded in UTF-8 [3] for collations which operate on strings as input and can be used to perform one or more of three
characters) as input and can be used to perform one or more of three basic comparison operations: equality test, substring match, and
basic comparison operations: equality test, substring match and
ordering test. ordering test.
Collations provide a multi-protocol abstraction layer for comparison 2.2. Purpose
functions so the details of a particular comparison operation can be
specified by someone with appropriate expertise independent of the
application protocol that consumes that collation. This is similar
to the way a charset [14] separates the details of octet to character
mapping from a protocol specification such as MIME [9] or the way
SASL [10] separates the details of an authentication mechanism from a
protocol specification such as ACAP [11].
Here a small diagram to help illustrate the value of this abstraction Collations abstraction layer for comparison functions so that these
layer: comparison functions can be used in multiple protocols. The details
of a particular comparison operation can be specified by someone with
+-----------------+ appropriate expertise independent of the application protocols that
| Octet | use that collation. This is similar to the way a charset [14]
+-------------------+ +--| Collation Spec | separates the details of octet to character mapping from a protocol
| IMAP i18n SEARCH |--+ | +-----------------+ specification such as MIME [10] or the way SASL [11] separates the
details of an authentication mechanism from a protocol specification
such as ACAP [12].
Here is a small diagram to help illustrate the value of this
abstraction layer:
+-------------------+ +-----------------+
| IMAP i18n SEARCH |--+ | Basic |
+-------------------+ | +--| Collation Spec |
| | +-----------------+
+-------------------+ | +-------------+ | +-----------------+ +-------------------+ | +-------------+ | +-----------------+
+--| Collation |--+--| A stringprep | | ACAP i18n SEARCH |--+--| Collation |--+--| A stringprep |
+-------------------+ | | Registry | | | Collation Spec | +-------------------+ | | Registry | | | Collation Spec |
| ACAP i18n SEARCH |--+ +-------------+ | +-----------------+ | +-------------+ | +-----------------+
+-------------------+ | +-----------------+ +-------------------+ | | +-----------------+
| | locale-specific | | ...other protocol |--+ | | locale-specific |
+--| Collation Spec | +-------------------+ +--| Collation Spec |
+-----------------+ +-----------------+
Thus IMAP, ACAP and future application protocols with international Thus IMAP, ACAP and future application protocols with international
search capability simply specify how to interface to the collation search capability simply specify how to interface to the collation
registry instead of each protocol spec having to specify all the registry instead of each protocol specification having to specify all
collations it supports. the collations it supports.
2.3. Some Other Terms Used in this Document
The terms client, server and protocol are used in somewhat unusual
senses.
Client means a user, or a program acting directly on behalf of a
user. This may be an mail reader acting as an IMAP client, or it may
be an interactive shell where the user can type protocol directly, or
it may be a script or program written by the user.
Server means a program that performs services requested by the
client. This may be a traditional server such as an HTTP server, or
it may be a Sieve [15] interpreter running a Sieve script written by
a user. A server needs to use the operations provided by collations
in order to fulfil the client's requests.
The protocol describes how the client tells the server what it wants
done, and (if applicable) how the server tells the client about the
results. IMAP is a protocol by this definition, and so is the Sieve
language.
2.4. Sort Keys
One component of a collation is a transformation which turns a string
into a sort key, which is then used while sorting.
The transformation can range from an identity mapping (e.g., the
i;octet collation Section 9.4) to a mapping which makes the string
unreadable to a human.
One component of a collation is a canonicalization function which can This is an implementation detail of collations or servers. A
be pre-applied to single strings and may enhance the performance of protocol SHOULD NOT expose it, since some collations leave the sort
subsequent comparison operations. Normally, this is an key's format up to the implementation, and current conformant
implementation detail of collations, but at times it may be useful implementations are known to use different formats.
for an application protocol to expose collation canonicalization over
protocol. Collation canonicalization can range from an identity 3. Collation Identifier Syntax
mapping (e.g., the i;octet collation) to a mapping which makes the
string unreadable to a human (e.g., the basic collation). 3.1. Basic Syntax
3. Collation Name Syntax The collation identifier itself is a single US-ASCII string beginning
with a letter and made up of letters, digits, and one of the
The collation name itself is a single US-ASCII string beginning with following 4 symbols: "-", ";", "=" and ".". The identifier MUST NOT
a letter and made up of letters, digits, or one of the following 4 be longer than 254 characters.
symbols: "-", ";", "=" or ".". The name MUST NOT be longer than 254
characters.
collation-char = ALPHA / DIGIT / "-" / ";" / "=" / "." collation-char = ALPHA / DIGIT / "-" / ";" / "=" / "."
collation-name = ALPHA *253collation-char collation-id = ALPHA *253collation-char
The string a client uses to select a collation MAY contain a wildcard The identifier "default" is reserved. For protocol which have a
("*") character which matches zero or more collation-chars. Wildcard default collation, "default" refers to that collation. For other
characters MUST NOT be adjacent. Clients which support disconnected protocols, the identifier "default" matches no collations, and
operation SHOULD NOT use wildcards to select a collation, but clients servers SHOULD treat it in the same way as they treat nonexistent
which provide collation operations only when connected to the server collations.
MAY use wildcards. If the wildcard string matches multiple
collations, the server SHOULD select the collation with the broadest 3.2. Wildcards
scope (preferably international scope), the most recent table
versions and the greatest number of supported operations. A single The string a client uses to select a collation MAY contain one or
wildcard character ("*") refers to the application protocol collation more wildcard ("*") character which matches zero or more collation-
behavior that would occur if no explicit negotiation were used. chars. Wildcard characters MUST NOT be adjacent. If the wildcard
string matches multiple collations, the server SHOULD select the
When used as a protocol element for ordering, the collation name MAY collation with the broadest scope (preferably international scope),
be prefixed by either "+" or "-" to explicitly specify an ordering the most recent table versions and the greatest number of supported
direction. As mentioned previously, "+" has no effect on the operations.
ordering function, while "-" negates the result of the ordering
function. In general, collation-order is used when a client requests
a collation, and collation-sel is used with the server informs the
client of the selected collation.
collation-wild = ("*" / (ALPHA ["*"])) *(collation-char ["*"]) collation-wild = ("*" / (ALPHA ["*"])) *(collation-char ["*"])
; MUST NOT exceed 255 characters total ; MUST NOT exceed 254 characters total
collation-sel = ["+" / "-"] collation-name 3.3. Ordering Direction
When used as a protocol element for ordering, the collation
identifier MAY be prefixed by either "+" or "-" to explicitly specify
an ordering direction. "+" has no effect on the ordering operation,
while "-" inverts the result of the ordering operation. In general,
collation-order is used when a client requests a collation, and
collation-selected is used when the server informs the client of the
selected collation.
collation-selected = ["+" / "-"] collation-id
collation-order = ["+" / "-"] collation-wild collation-order = ["+" / "-"] collation-wild
Some protocols are designed to use URIs to refer to collations rather 3.4. URIs
than simple tokens. A special section of the IANA web page is
Some protocols are designed to use URIs [4] to refer to collations
rather than simple tokens. A special section of the IANA web page is
reserved for such usage. The "collation-uri" form is used to refer reserved for such usage. The "collation-uri" form is used to refer
to a specific IANA registry entry for a specific named collation (the to a specific IANA registry entry for a specific named collation (the
collation registration may not actually be present if it is collation registration may not actually be present if it is
experimental). The "collation-auri" form is an abstract name for an experimental). The "collation-auri" form is an abstract name for an
ordering, a comparator pattern or a vendor private comparator. ordering, a collation pattern or a vendor private collator.
collation-uri = "http://www.iana.org/assignments/collation/" collation-uri = "http://www.iana.org/assignments/collation/"
collation-name ".xml" collation-id ".xml"
collation-auri = ( "http://www.iana.org/assignments/collation/" collation-auri = ( "http://www.iana.org/assignments/collation/"
collation-order [".xml"]) / other-uri collation-order ".xml" ) / other-uri
other-uri = absoluteURI other-uri = <absoluteURI>
; excluding the IANA collation namespace. ; excluding the IANA collation namespace.
3.5. Naming Guidelines
While this specification makes no absolute requirements on the While this specification makes no absolute requirements on the
structure of collation names, naming consistency is important, so the structure of collation identifiers, naming consistency is important,
following initial guidelines are provided. so the following initial guidelines are provided.
Collation names with an international audience typically begin with Collation identifiers with an international audience typically begin
"i;". Collation names intended for a particular language or locale with "i;". Collation identifiers intended for a particular language
typically begin with a language tag [5] followed by a ";". After the or locale typically begin with a language tag [5] followed by a ";".
first ";" is normally the name of the general collation algorithm After the first ";" is normally the name of the general collation
followed by a series of algorithm modifications separated by the ";" algorithm, followed by a series of algorithm modifications separated
delimiter. Parameterized modifications will use "=" to delimit the by the ";" delimiter. Parameterized modifications will use "=" to
parameter from the value. The version numbers of any lookup tables delimit the parameter from the value. The version numbers of any
used by the algorithm SHOULD be present as parameterized lookup tables used by the algorithm SHOULD be present as
modifications. parameterized modifications.
Collation names of the form *;vnd-domain.com;* are reserved for Collation identifiers of the form *;vnd-domain.com;* are reserved for
vendor-specific collations created by the owner of the domain name vendor-specific collations created by the owner of the domain name
following the "vnd-" prefix. Registration of such collations (or the following the "vnd-" prefix (e.g. vnd-example.com for the vendor
name space as a whole) with intended use of "Vendor" is encouraged example.com). Registration of such collations (or the name space as
when a public specification or open-source implementation is a whole) with intended use of "Vendor" is encouraged when a public
available, but is not required. specification or open-source implementation is available, but is not
required.
4. Collation Specification Requirements
4.1. Collation/Server Interface
The collation itself defines what it operates on. Most collations
are expected to operate on character strings. The i;octet
(Section 9.4) collation operates on octet strings. The i;ascii-
numeric (Section 9.1) operation operates on numbers.
This specification defines the collation interface in terms of octet
strings. However, implementations may choose to use character
strings instead. Such implementations may not be able to implement
e.g. i;octet. Since i;octet is not currently mandatory to implement
for any protocol, this should not be a problem.
4. Collation Specification Requirements 4.2. Operations Supported
A collation specification MUST state which of the three basic A collation specification MUST state which of the three basic
functions are supported (equality, substring, ordering) and how to operations are supported (equality, substring, ordering) and how to
perform each of the supported functions on any two input perform each of the supported operations on any two input character
octet-strings including empty strings. Given a collation with a strings including empty strings. Collations must be deterministic,
specific name, and any two fixed input strings, the result MUST be i.e. given a collation with a specific identifier, and any two fixed
the same. The collation specification MUST state whether the input strings, the result MUST be the same for the same operation.
collation operates on raw octets or on characters (in which case the
UTF-8 charset is presumed). Collations MUST be transitive. In general, collation operations should behave as their names
suggest. While a collation may be new, the operations are not, so
A collation specification MUST describe the internal canonicalization the new collation's operations should be similar to those of older
algorithm. This algorithm can be applied to individual strings and collations. For example, a date/time collation should not provide a
the result strings can be stored to potentially optimize future "substring" operation that would morph IMAP substring SEARCH into
comparison operations. A collation MAY specify that the e.g. a date-range search.
canonicalization algorithm is the identity function. The output of
the canonicalization algorithm MAY have no meaning to a human. A nonobvious consequence of the rules for each collation operation is
that for any single collation, either none or all of the operations
Collations which use more than one customizable lookup table in a can return "undefined". For example, it is not possible to have an
documented format MUST assign numbers to the tables they use. This equality operation that never returns "undefined" and a substring
permits an application protocol command to access the tables used by operation that occasionally does.
a server collation.
4.2.1. Validity
o The equality function always returns "match" or "no-match" when
supplied valid input and MAY return "error" if the input strings The validity test takes one string as argument returns valid if its
are not valid UTF-8 strings or violate other collation input string is valid input to collation's other operations, and
constraints. invalid if not. (In other words, a string is valid if it is equal to
itself according to the collation's equality operation.)
o The substring matching function determines if the first string is
a substring of the second string. A collation which supports The validity test is provided by all collations. It MUST NOT be
substring matching will automatically support the two special listed separately in the collation registration.
cases of substring matching: prefix and suffix matching if those
special cases are supported by the application protocol. It 4.2.2. Equality
returns "match" or "no-match" when supplied valid input and
returns "error" when supplied invalid input. The equality test always returns "match" or "no-match" when supplied
valid input, and MAY return "undefined" if one or both input strings
o The ordering function determines how two octet strings are are not valid.
ordered. It returns "-1" if the first string is listed before the
second string according to the collation, "+1" if the second The equality test MUST be reflexive and symmetric. For valid input,
string is listed before the first string, and "0" if the two it MUST be transitive.
strings are equal. If the order of the two strings is reversed,
the result of the ordering function of the collation MUST be If a collation provides either a substring or an ordering test, it
negated. In general, collations SHOULD NOT return "0" unless the MUST also provide an equality test. The substring and/or ordering
two octet sequences are identical. tests MUST be consistent with the equality test.
Since ordering is normally used to sort a list of items, "error" In this specification, the return values of the equality test are
is not a useful return value from the ordering function. Strings called "match", "no-match" and "undefined". This is not a
with errors that prevent the sorting algorithm from functioning specification, merely a choice of phrasing.
correctly should sort to the end of the list. Thus if the first
string is invalid UTF-8 while the second string is valid, the 4.2.3. Substring
result will be "+1". If the second string is invalid UTF-8 while
the first string is valid, the result will be "-1". If the The substring matching operation determines if the first string is a
collation is character-based, and both strings are invalid UTF-8, substring of the second string, ie. if one or more substrings of the
the result SHOULD match the result from the "i;octet" collation. second string is equal to the first, as defined by the collation's
equality operation.
When the collation is used with a "+" prefix, the behavior is the
same as when used with no prefix. When the collation is used with A collation which supports substring matching will automatically
a "-" prefix, results which would be "+1" are instead "-1" and support two special cases of substring matching: prefix and suffix
results which would be "-1" are instead "+1". matching if those special cases are supported by the application
protocol. It returns "match" or "no-match" when supplied valid input
Unless otherwise specified by the collation or application protocol, and returns "undefined" when supplied invalid input.
a NULL string (as opposed to an empty string) is equal only to
another NULL string, a NULL string is not a substring of any other
string, and a NULL string sorts to a position after all non-NULL
strings, but before strings which generate errors.
Some application protocols will permit the use of multi-value
attributes with a collation. This paragraph describes the rules that
apply unless otherwise specified by the collation or application
protocol. The equality and substring collation algorithms will be
iterated over each pair of single values from the two inputs. If any
combination produces an error, the result is an error. Otherwise, if
any combination produces a "match", the result is a match. Otherwise
the result is "no-match". For the ordering function, the smallest
ordinal octet string from the first set of values is compared to the
smallest ordinal octet string from the second set of values.
Application protocols MAY return position information for substring Application protocols MAY return position information for substring
matches. If this is done, the position information MUST include both matches. If this is done, the position information SHOULD include
the starting offset and the ending offset in the string. This is both the starting offset and the ending offset for each match. This
important because more sophisticated collations can match strings of is important because more sophisticated collations can match strings
unequal length (for example, a pre-composed accented character will of unequal length (for example, a pre-composed accented character can
match a decomposed accented character). match a decomposed accented character). In general, overlapping
matches SHOULD be reported (as when "ana" occurs twice within
Collation specifications intended for common use are expected to "banana") although there are cases where a collation may decide not
reference standards from standards bodies with significant experience to. For example, in a collation which treats all whitespace
dealing with the details of international character sets. sequences as identical, the substring operation could be defined such
that " 1 " (SP "1" SP) is reported just once within " 1 " (SP SP "1"
5. Application Protocol Requirements SP SP), not four times (SP SP 1 SP, SP 1 SP, SP 1 SP SP and SP SP 1
SP SP).
An application protocol which offers searching, substring matching
and/or sorting and permits the use of characters outside the US-ASCII A string is a substring of itself. The empty string is a substring
charset needs to consider the following requirements and issues: of all strings.
Note that the substring operation of some collations can match
strings of unequal length. For example, a pre-composed accented
character can match a decomposed accented character. Unicode
Collation Algorithm [8] discusses this in more detail.
In this specification, the return values of the substring operation
are called "match", "no-match" and "undefined". This is not a
specification, merely a choice of phrasing.
4.2.4. Ordering
The ordering operation determines how two strings are ordered. It
MUST be trichotomous and reflexive. For valid input, it MUST be
transitive.
Ordering returns "less" if the first string is listed before the
second string according to the collation, "greater" if the second
string is listed before the first string, and "equal" if the two
strings are equal as defined by the collation's equality operation.
If one or both strings are invalid, the result of ordering is
"undefined".
When the collation is used with a "+" prefix, the behavior is the
same as when used with no prefix. When the collation is used with a
"-" prefix, the result of the ordering operation of the collation
MUST be reversed.
In this specification, the return values of the ordering operation
are called "less", "equal", "greater" and "undefined". This is not a
specification, merely a choice of phrasing.
4.3. Sort Keys
A collation specification SHOULD describe the internal transformation
algorithm to generate sort keys. This algorithm can be applied to
individual strings and the result can be stored to potentially
optimize future comparison operations. A collation MAY specify that
the sort key is generated by the identity function. The sort key may
have no meaning to a human. The sort key may not be valid input to
the collation.
4.4. Use of Lookup Tables
Some collations use customizable lookup tables, e.g. because the
tables depend on locale and may be modified after shipping the
software. Collations which use more than one customizable lookup
table in a documented format MUST assign numbers to the tables they
use. This permits an application protocol command to access the
tables used by a server collation, so that clients and servers use
the same tables.
5. Application Protocol Requirements
This section describes the requirements and issues that an
application protocol needs to consider if it offers searching,
substring matching and/or sorting, and permits the use of characters
outside the US-ASCII charset.
5.1. Character Encoding
The protocol specification has to make sure that it is clear on which
characters (rather than just octets) the collations are used. This
can be done by specifying the protocol itself in terms of characters
(e.g. in the case of a query language), by specifying a single
character encoding for the protocol (e.g. UTF-8 [3]), or by
carefully describing the relevant issues of character encoding
labeling and conversion. In the later case, details to consider
include how to handle unknown charsets, any charsets which are
mandatory-to-implement, any issues with byte-order that might apply,
and any transfer encodings which need to be supported.
5.2. Operations
The protocol must specify which of the operations defined in this
specification (equality matching, substring matching and ordering)
can be invoked in the protocol, and how they are invoked. There may
be more than one way to invoke an operation.
The protocol MUST provide a mechanism for the client to select the The protocol MUST provide a mechanism for the client to select the
collation to use with equality matching, substring matching and collation to use with equality matching, substring matching and
ordering. ordering.
The protocol MUST specify how comparisons behave in the absence of an If a protocol needs a total ordering and the collation chosen does
explicit collation negotiation or when a collation negotiation of "*" not provide it because the ordering operation returns "undefined" at
is used. The protocol MAY specify that the default collation used in least once, the recommended fallback is to sort all invalid strings
such circumstances is sensitive to server configuration. after the valid ones, and use i;octet to order the invalid strings.
The protocol SHOULD provide a way to list available collations Although the collation's substring function provides a list of
matching a given wildcard pattern or patterns. matches, a protocol need not provide all that to the client. It may
provide only the first matching substring, or even just the
information that the substring search matched.
If the protocol provides positional information for the results of a If the protocol provides positional information for the results of a
substring match, that positional information MUST fully specify the substring match, that positional information SHOULD fully specify the
substring in the result that matches independent of the length of the substring(s) in the result that matches independent of the length of
search string. For example, returning both the starting and ending the search string. For example, returning both the starting and
offset of the match would suffice, as would the starting offset and a ending offset of the match would suffice, as would the starting
length. Returning just the starting offset is not acceptable. This offset and a length. Returning just the starting offset is not
rule is necessary because advanced collations can treat strings of acceptable. This rule is necessary because advanced collations can
different lengths as equal (for example, pre-composed and decomposed treat strings of different lengths as equal (for example, pre-
accented characters). composed and decomposed accented characters).
If the protocol permits the use of collations on stored character 5.3. Wildcards
data which is not encoded with the UTF-8 charset, then the protocol
specification has to describe relevant issues of the conversion. The protocol MUST specify whether it allows the use of wildcards in
Details to consider include how to handle unknown charsets, any collation identifiers or not. If the protocol allows wildcards,
charsets which are mandatory-to-implement, any issues with byte-order then:
that might apply, and any transfer encodings which need to be The protocol MUST specify how comparisons behave in the absence of
supported. explicit collation negotiation or when a collation of "*" is
requested. The protocol MAY specify that the default collation
used in such circumstances is sensitive to server configuration.
The protocol SHOULD provide a way to list available collations
matching a given wildcard pattern or patterns.
5.4. Canonicalization Function
If the protocol uses a canonicalization function for strings, then
use of collations MAY be appropriate for that function. As an
example, many protocols use case independent strings. In most cases,
a simple ASCII mapping to upper/lower case works well, as i;ascii-
casemap offers. However, in some cases another collation may be
better, e.g. to handle Turkish dotted/dotless i. Protocol designers
should consider in each case whether to use a specifiable collation.
If the protocol provides a canonicalization function for strings, 5.5. Disconnected Clients
then use of collations MAY be appropriate for that function.
If the protocol supports disconnected clients, then a mechanism for If the protocol supports disconnected clients, then a mechanism for
the client to precisely replicate the server's collation algorithm is the client to precisely replicate the server's collation algorithm is
likely desirable. Thus the protocol MAY wish to provide a command to likely desirable. Thus the protocol MAY wish to provide a command to
fetch lookup tables used by charset conversions and collations. fetch lookup tables used by charset conversions and collations.
5.6. Error Codes
The protocol specification should consider assigning protocol error The protocol specification should consider assigning protocol error
codes for the following circumstances: codes for the following circumstances:
o The client requests the use of a collation by identifier or
o The client requests the use of a collation by name or pattern, but pattern, but no implemented collation matches that pattern.
no implemented collation matches that pattern. o The client attempts to use a collation for an operation that is
not supported by that collation. For example, attempting to use
o The client attempts to use a collation for a function that is not the "i;ascii-numeric" collation for substring matching.
supported by that collation. For example, attempting to use the
"i;ascii-numeric" collation for a substring matching function.
o The client uses an equality or substring matching collation and o The client uses an equality or substring matching collation and
the result is an error. It may be appropriate to distinguish the result is an error. It may be appropriate to distinguish
between the two input strings, particularly when one is supplied between the two input strings, particularly when one is supplied
by the client and one is stored by the server. It might also be by the client and one is stored by the server. It might also be
appropriate to distinguish the specific case of an invalid UTF-8 appropriate to distinguish the specific case of an invalid UTF-8
string. string.
If the protocol permits the use of a collation with data structures 5.7. Octet Collation
beyond those described in this specification (octet strings, NULL
string, array of octet strings), the protocol MUST describe the
default behavior for a collation with that data structure.
6. Initial Collations The i;octet (Section 9.4) collation is only usable with protocols
based on octet-strings. Clients and servers MUST NOT use i;octet
with other protocols.
This section describes an initial set of collations for the collation If the protocol permits the use of collations with data structures
registry. other than strings, the protocol MUST describe the default behavior
for a collation with those data structures.
6.1 Octet Collation
The "i;octet" collation is a simple and fast collation intended for
use on binary octet strings rather than on character data. It never
returns an "error" result. It provides equality, substring and
ordering functions. The ordering algorithm is as follows:
1. If both strings are the empty string, return the result "0".
2. If the first string is empty and the second is not, return the
result "-1".
3. If the second string is empty and the first is not, return the 6. Use by Existing Protocols
result "+1".
4. If both strings begin with the same octet value, remove the first
octet from both strings and repeat this algorithm from step 1.
5. If the unsigned value (0 to 255) of the first octet of the first Both ACAP [12] and Sieve [15] are standards track specifications
string is less than the unsigned value of the first octet of the which used collations prior to the creation of this specification and
second string, then return "-1". registry. Those standards do not meet all the application protocol
requirements described in Section 5.
6. If this step is reached, return "+1". These protocols allow the use of the i;octet (Section 9.4) collation
working directly on UTF-8 data as used in these protocols.
This algorithm is roughly equivalent to the C library function memcmp In Sieve, all matches are either true and false. Accordingly, Sieve
with appropriate length checks added. servers must treat "undefined" and "no-match" results of the equality
and substring operations as false, and only "match" as true.
The matching function returns "match" if the sorting algorithm would In ACAP and Sieve, there are no invalid strings. In this document's
return "0". Otherwise the matching function returns "no-match". terms, invalid strings sort after valid strings.
The substring function returns "match" if the first string is the IMAP [16] also collates, although that is explicit only when the
empty string, or if there exists a substring of the second string of COMPARATOR [18] extension is used. The built-in IMAP substring
length equal to the length of the first string which would result in operation and the ordering provided by the SORT [17] extension may
a "match" result from the equality function. Otherwise the substring not meet the requirements made in this document.
function returns "no-match".
The associated canonicalization algorithm is the identity function. Other protocols may be in a similar position.
6.2 ASCII Numeric Collation In IMAP, the default collation is i;ascii-casemap, because its
operations most closely resembles IMAP's built-in operations.
The "i;ascii-numeric" collation is a simple collation intended for 7. Collation Registration
use with arbitrary sized decimal numbers stored as octet strings of
US-ASCII digits (0x30 to 0x39). It supports equality and ordering,
but does not support the substring function. The algorithm is as
follows:
1. If neither string begins with a digit, return "error" if
matching, or the result of the "i;octet" collation for ordering.
2. If the first string begins with a digit and the second string
does not, return "error" if matching and "-1" for ordering.
3. If the second string begins with a digit and the first string
does not, return "error" if matching and "+1" for ordering.
4. Let "n" be the number of digits at the beginning of the first
string, and "m" be the number of digits at the beginning of the
second string.
5. If n is equal to m, return the result of the "i;octet" collation.
6. If n is greater than m, prepend a string of "n - m" zeros to the
second string and return the result of the "i;octet" collation.
7. If m is greater than n, prepend a string of "m - n" zeros to the
first string and return the result of the "i;octet" collation.
The associated canonicalization algorithm is to truncate the input
string at the first non-digit character.
6.3 ASCII Casemap Collation
The "en;ascii-casemap" collation is a simple collation intended for
use with English language text in pure US-ASCII. It provides
equality, substring and ordering functions. The algorithm first
applies a canonicalization algorithm to both input strings which
subtracts 32 (0x20) from all octet values between 97 (0x61) and 122
(0x7A) inclusive. The result of the collation is then the same as
the result of the "i;octet" collation for the canonicalized strings.
Care should be taken when using OS-supplied functions to implement
this collation as this is not locale sensitive, but functions such as
strcasecmp and toupper can be locale sensitive.
For historical reasons, in the context of ACAP and Sieve, the name 7.1. Collation Registration Procedure
"i;ascii-casemap" is a synonym for this collation.
6.4 Nameprep Collation The IETF will create a mailing list, collation@ietf.org, which can be
used for public discussion of collation proposals prior to
registration. Use of the mailing list is strongly encouraged. The
IESG will appoint a designated expert who will monitor the
collation@ietf.org mailing list and review registrations.
The "i;nameprep;v=1;uv=3.2" collation is an implementation of the The registration procedure begins when a completed registration
nameprep [7] specification based on normalization tables from Unicode template is sent to iana@iana.org and collation@ietf.org. The
version 3.2. This collation applies the nameprep canoncialization
function to both input strings and then returns the result of the
i;octet collation on the canonicalized strings. While this collation
offers all three functions, the ordering function it provides is
inadequate for use by the majority of the world.
Version number 1 is applied to nameprep as specified in RFC 3491. If
the nameprep specification is revised without any changes that would
produce different results when given the same pair of input octet
strings, then the version number will remain unchanged.
The table numbers for tables used by nameprep are as follows:
+--------------+-----------------------+
| Table Number | Table Name |
+--------------+-----------------------+
| 1 | UnicodeData-3.2.0.txt |
| 2 | Table B.1 |
| 3 | Table B.2 |
| 4 | Table C.1.2 |
| 5 | Table C.2.2 |
| 6 | Table C.3 |
| 7 | Table C.4 |
| 8 | Table C.5 |
| 9 | Table C.6 |
| 10 | Table C.7 |
| 11 | Table C.8 |
| 12 | Table C.9 |
+--------------+-----------------------+
6.5 Basic Collation
The basic collation is intended to provide tolerable results for a
number of languages for all three functions (equality, substring and
ordering) so it is suitable as a mandatory-to-implement collation for
protocols which include ordering support. The ordering function of
the basic collation is the Unicode Collation Algorithm [8] version 9
(UCAv9).
The equality and substring functions are created as described in
UCAv9 section 8. While that section is informative to UCAv9, it is
normative to this collation specification.
This collation is based on Unicode version 3.2, with the following
tables relevant:
1. For the normalization step, UnicodeData-3.2.0.txt [16] is used.
Column 5 is used to determine the canonical decomposition, while
column 3 contains the canonical combining classes necessary to
attain canonical order.
2. The table of characters which require a logical order exception
is a subset of the table in PropList-3.2.0.txt [17] and is
included here:
0E40..0E44 ; Logical_Order_Exception
# Lo [5] THAI CHARACTER SARA E..THAI CHARACTER SARA AI MAIMALAI
0EC0..0EC4 ; Logical_Order_Exception
# Lo [5] LAO VOWEL SIGN E..LAO VOWEL SIGN AI
# Total code points: 10
3. The table used to translate normalized code points to a sort key
is allkeys-3.1.1.txt [18].
UCAv9 includes a number of configurable parameters and steps labelled
as potentially optional. The following list summarizes the defaults
used by this collation:
o The logical order exception step is mandatory by default to
support the largest number of languages.
o Steps 2.1.1 to 2.1.3 are mandatory as the repertoire of the basic
collation is intended to be large.
o The second level in the sort key is evaluated forwards by default.
o The variable weighting uses the "non-ignorable" option by default.
o The semi-stable option is not used by default.
o Support for exactly three levels of collation is the default
behavior.
o No preprocessing step is used by the basic collation prior to
applying the UCAv9 algorithm. Note that an application protocol
specification MAY require pre-processing prior to the use of any
collations.
o The equality and substring algorithms exclude differences at level
2 and 3 by default (thus it is case-insensitive and ignores
accentual distinctions.
o The equality and substring algorithms use the "Whole Characters
Only" feature described in UCAv9 section 8 by default.
The exact collation name with these defaults is
"i;basic;uca=3.1.1;uv=3.2". When a specification states that the
basic collation is mandatory-to-implement, only this specific name is
mandatory-to-implement.
In order to allow modification of the optional behaviors, the
following ABNF is used for variations of the basic collation:
basic-collation = ("i" / Language-Tag) ";basic;uca=3.1.1;uv=3.2"
[";match=accent" / ";match=case"]
[";tailor=" 1*collation-char ]
If multiple modifiers appear, they MUST appear in the order described
above. The modifiers have the following meanings:
match=accent Both the first and second levels of the sort keys are
considered relevant to the equality and substring
operations (rather than the default of first level
only). This makes the matching functions sensitive to
accentual distinctions.
match=case The first three levels of sort keys are considered
relevant to the equality and substring operations.
This makes the matching functions sensitive to both
case and accentual distinctions.
The default weighting option is "non-ignorable". The "semi-stable"
sort key option is not used by default.
The canonicalization algorithm associated with this collation is the
output of step 3 of the UCAv9 algorithm (described in section 4.3 of
the UCA specification). This canonicalization is not suitable for
human consumption.
Finally, the UCAv9 algorithm permits the "allkeys" table to be
tailored to a language. People who make quality tailorings are
encouraged to register those tailorings using the collation registry.
Tailoring names beginning with "x" are reserved for experimental use,
are treated as "Limited use" and MUST NOT match wildcards if any
registered collation is available that does match.
7. Use by ACAP and Sieve
Both ACAP [11] and Sieve [15] are standards track specifications
which used collations prior to the creation of this specification and
registry. Those standards do not meet all the application protocol
requirements described in Section 5. For backwards compatibility,
those protocols use the "i;ascii-casemap" instead of
"en;ascii-casemap".
8. IANA Considerations
8.1 Collation Registration Procedure
IANA will create a mailing list collation@iana.org which can be used
for public discussion of collation proposals prior to registration.
Use of the mailing list is encouraged but not required. The actual
registration procedure will not begin until the completed
registration template is sent to iana@iana.org. The IESG will
appoint a designated expert who will monitor the collation@iana.org
mailing list and review registrations forwarded from IANA. The
designated expert is expected to tell IANA and the submitter of the designated expert is expected to tell IANA and the submitter of the
registration within two weeks whether the registration is approved, registration within two weeks whether the registration is approved,
approved with minor changes, or rejected with cause. When a approved with minor changes, or rejected with cause. When a
registration is rejected with cause, it can be re-submitted if the registration is rejected with cause, it can be re-submitted if the
concerns listed in the cause are addressed. Decisions made by the concerns listed in the cause are addressed. Decisions made by the
designated expert can be appealed to the IESG and subsequently follow designated expert can be appealed to IESG Applications Area Director,
the normal appeals procedure for IESG decisions. then to the IESG. They follow the normal appeals procedure for IESG
decisions.
Collation registrations in a standards track, BCP or IESG-approved Collation registrations in a standards track, BCP or IESG-approved
experimental RFC are owned by the IESG and changes to the experimental RFC are owned by the IETF, and changes to the
registration follow normal procedures for updating such documents. registration follow normal procedures for updating such documents.
Collation registrations in other RFCs are owned by the RFC author(s). Collation registrations in other RFCs are owned by the RFC author(s).
Other collation registrations are owned by the individual(s) listed Other collation registrations are owned by the individual(s) listed
in the contact field of the registration and IANA will preserve this in the contact field of the registration and IANA will preserve this
information. Changes to a registration MUST be approved by the information. Changes to a registration MUST be approved by the
owner. In the event the owner can't be contacted for a period of one owner. In the event the owner cannot be contacted for a period of
month and a change is deemed necessary, the IESG MAY re-assign one month and a change is deemed necessary, the IESG MAY re-assign
ownership to an appropriate party. ownership to an appropriate party.
8.2 Collation Registration Template 7.2. Collation Registration Format
Registration of a collation is done by sending a well-formed XML Registration of a collation is done by sending a well-formed XML
document that validates with collationreg.dtd (Section 9). The document to collation@ietf.org and iana@iana.org.
registration MUST include a collation element that MAY include an
"rfc=" attribute if the specification is in an RFC and MUST include a
scope attribute of "i18n", "local" or "other" and an intendedUse
attribute of "common", "limited", "vendor", or "deprecated".
The collation element contains the other elements in the 7.2.1. Registration Template
registration. The mandatory name element gives the precise name of
the comparator. The mandatory title element give the title of the
comparator. The mandatory functions element lists which of the three
functions the comparator provides. The mandatory specification
element describes where to find the specification, and MAY have a URI
attribute. The submittor element provides an RFC 2822 email address
for the person who submitted the registration. It is optional if the
owner element contains an email address. The mandatory owner element
contains either the four letters "IETF" or an email address of the
owner of the registration. The optional version element is included
when the registration is likely to be revised or has been revised in
such a way that the results change for certain input strings. The
optional UnicodeVersion element indicates the version number of the
UnicodeData file on which the collation is based. The optional
UCAVersion element specifics the version of the Unicode Collation
Algorithm on which the collation is based. The optional
UCAMatchLevel element specifies the number of Unicode Collation
Algorithm sort key levels used for the equality and substring
operations.
Here is a template for the registration: Here is a template for the registration:
<?xml verison='1.0'?> <?xml version='1.0'?>
<!DOCTYPE rfc SYSTEM 'collationreg.dtd'> <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
<collation rfc="XXXX" scope="i18n" intendedUse="common"> <collation rfc="YYYY" scope="i18n" intendedUse="common">
<name>collation name</name> <identifier>collation identifier</identifier>
<title>technical title for collation</title> <title>technical title for collation</title>
<functions>equality order substring</functions> <operations>equality order substring</operations>
<specification>specification reference</specification> <specification>specification reference</specification>
<owner>email address of owner or IETF</owner> <owner>email address of owner or IETF</owner>
<submittor>email address of submittor<submittor> <submitter>email address of submitter</submitter>
<version>1</version> <version>1</version>
<UnicodeVersion>3.2</UnicodeVersion>
<UCAVersion>3.1.1</UCAVersion>
</collation> </collation>
7.2.2. The collation Element
The root of the registration document MUST be a <collation> element.
The collation element contains the other elements in the
registration, which are described in the following sub-subsections,
in the order given here.
The <collation> element MAY include an "rfc=" attribute if the
specification is in an RFC. The "rfc=" attribute gives only the
number of the RFC, without any prefix, such as "RFC", or suffix, such
as ".txt".
The <collation> element MUST include a "scope=" attribute, which MUST
have one of the values "i18n", "local" or "other".
The <collation> element MUST include an "intendedUse=" attribute,
which must have one of the values "common", "limited", "vendor", or
"deprecated". Collation specifications intended for "common" use are
expected to reference standards from standards bodies with
significant experience dealing with the details of international
character sets.
Be aware that future revisions of this specification may add Be aware that future revisions of this specification may add
additional function types, as well as additional XML attributes and additional function types, as well as additional XML attributes,
values. Any system which automatically parses these XML documents values and elements. Any system which automatically parses these XML
MUST take this into account to preserve future compatibility. documents MUST take this into account to preserve future
compatibility.
8.3 Octet Collation Registration 7.2.3. The identifier Element
<?xml verison='1.0'?> The <identifier> element gives the precise identifier of the
<!DOCTYPE rfc SYSTEM 'collationreg.dtd'> collation, e.g. i;ascii-casemap. The <identifier> element is
<collation rfc="XXXX" scope="i18n" intendedUse="common"> mandatory.
<name>i;octet</name>
<title>Octet</title>
<functions>equality order substring</functions>
<specification>RFC XXXX</specification>
<owner>IETF</owner>
<submittor>chris.newman@sun.com<submittor>
</collation>
8.4 ASCII Numeric Collation Registration 7.2.4. The title Element
<?xml verison='1.0'?> The <title> element gives the title of the collation. The <title>
<!DOCTYPE rfc SYSTEM 'collationreg.dtd'> element is mandatory.
<collation rfc="XXXX" scope="other" intendedUse="limited">
<name>i;ascii-numeric</name>
<title>ASCII Numeric</title>
<functions>equality order</functions>
<specification>RFC XXXX</specification>
<owner>IETF</owner>
<submittor>chris.newman@sun.com<submittor>
</collation>
8.5 Legacy English Casemap Collation Registration 7.2.5. The operations Element
<?xml verison='1.0'?> The <operations> element lists which of the three operations
<!DOCTYPE rfc SYSTEM 'collationreg.dtd'> ("equality", "order" or "substring") the collation provides,
<collation rfc="XXXX" scope="local" intendedUse="deprecated"> separated by single spaces. The <operations> element is mandatory.
<name>i;ascii-casemap</name>
<title>Legacy English Casemap</title>
<functions>equality order substring</functions>
<specification>RFC XXXX</specification>
<owner>IETF</owner>
<submittor>chris.newman@sun.com<submittor>
</collation>
8.6 English Casemap Collation Registration 7.2.6. The specification Element
<?xml verison='1.0'?> The <specification> element describes where to find the
<!DOCTYPE rfc SYSTEM 'collationreg.dtd'> specification. The <specification> element is mandatory. It MAY
<collation rfc="XXXX" scope="local" intendedUse="common"> have a URI attribute. There may be more than one <specification>
<name>en;ascii-casemap</name> elements, in which case they together form the specification.
<title>English Casemap</title>
<functions>equality order substring</functions>
<specification>RFC XXXX</specification>
<owner>IETF</owner>
<submittor>chris.newman@sun.com<submittor>
</collation>
8.7 Nameprep Collation Registration If it is discovered that parts of a collation specification conflict,
a new revision of the collation is necessary, and the
collation@ietf.org mailing list should be notified.
<?xml verison='1.0'?> 7.2.7. The submitter Element
<!DOCTYPE rfc SYSTEM 'collationreg.dtd'>
<collation rfc="XXXX" scope="i18n" intendedUse="common">
<name>i;nameprep;v=1;uv=3.2</name>
<title>Nameprep</title>
<functions>equality order substring</functions>
<specification>RFC XXXX</specification>
<owner>IETF</owner>
<submittor>chris.newman@sun.com<submittor>
<version>1</version>
<UnicodeVersion>3.2</UnicodeVersion>
</collation>
8.8 Basic Collation Registration The <submitter> element provides an RFC 2822 [13] email address for
the person who submitted the registration. It is optional if the
<owner> element contains an email address.
<?xml verison='1.0'?> There may be more than one <submitter> element.
<!DOCTYPE rfc SYSTEM 'collationreg.dtd'>
<collation rfc="XXXX" scope="i18n" intendedUse="common">
<name>i;basic;uca=3.1.1;uv=3.2</name>
<title>Basic</title>
<functions>equality order substring</functions>
<specification>RFC XXXX</specification>
<owner>IETF</owner>
<submittor>chris.newman@sun.com<submittor>
<UnicodeVersion>3.2</UnicodeVersion>
<UCAVersion>3.1.1</UCAVersion>
<UCAMatchLevel>1</UCAMatchLevel>
</collation>
8.9 Basic Accent Sensitive Match Collation Registration 7.2.8. The owner Element
<?xml verison='1.0'?> The <owner> element contains either the four letters "IETF" or an
<!DOCTYPE rfc SYSTEM 'collationreg.dtd'> email address of the owner of the registration. The <owner> element
<collation rfc="XXXX" scope="i18n" intendedUse="common"> is mandatory. There may be more than one <owner> element. If so,
<name>i;basic;uca=3.1.1;uv=3.2;match=accent</name> all owners are equal. Each owner can speak for all.
<title>Basic Accent Sensitive Match</title>
<functions>equality order substring</functions>
<specification>RFC XXXX</specification>
<owner>IETF</owner>
<submittor>chris.newman@sun.com<submittor>
<UnicodeVersion>3.2</UnicodeVersion>
<UCAVersion>3.1.1</UCAVersion>
<UCAMatchLevel>2</UCAMatchLevel>
</collation>
8.10 Basic Case Sensitive Match Collation Registration 7.2.9. The version Element
<?xml verison='1.0'?> The <version> element is included when the registration is likely to
<!DOCTYPE rfc SYSTEM 'collationreg.dtd'> be revised or has been revised in such a way that the results change
<collation rfc="XXXX" scope="i18n" intendedUse="common"> for certain input strings. The <version> element is optional.
<name>i;basic;uca=3.1.1;uv=3.2;match=case</name>
<title>Basic Case Sensitive Match</title> 7.2.10. The variable Element
<functions>equality order substring</functions>
<specification>RFC XXXX</specification> The <variable> element specifies an optional variable using which the
<owner>IETF</owner> collation's behaviour can be tailored. The <variable> element is
<submittor>chris.newman@sun.com<submittor> optional. When it is used, it must contain <name> and <default>
<UnicodeVersion>3.2</UnicodeVersion> elements and may contain one or more <value> elements.
<UCAVersion>3.1.1</UCAVersion>
<UCAMatchLevel>3</UCAMatchLevel>
</collation>
8.11 Structure of Collation Registry 7.2.11. The name Element
The <name> element specifies the name value of a variable. The
<name> element is mandatory.
7.2.12. The default Element
The <default> element specifies the default value of a variable. The
<default> element is mandatory.
7.2.13. The value Element
The <value> element specifies a legal value of a variable. The
<value> element is optional. If one or more <value> elements are
present, only those values are legal. If none is, then the
variable's legal values do not form an enumerated set, and the rules
MUST be specified in an RFC accompanying the registration.
7.3. Structure of Collation Registry
Once the registration is approved, IANA will store each XML Once the registration is approved, IANA will store each XML
registration document in a URL of the form http://www.iana.org/ registration document in a URL of the form
assignments/collation/collation-name.xml where collation-name is the http://www.iana.org/assignments/collation/collation-id.xml where
contents of the name element in the registration. Both the submittor collation-id is the contents of the identifier element in the
and the designated expert is responsible for verifying that the XML registration. Both the submitter and the designated expert are
is well-formed and complies with the DTD. In the future, it is hoped responsible for verifying that the XML is well-formed. The
IANA will take over XML verification responsibility from the registration document should avoid using new elements. If any are
designated expert. necessary, it is important to be consistent with other registrations.
IANA will also maintain a text summary of the registry under the name IANA will also maintain a text summary of the registry under the name
http://www.iana.org/assignments/collation/summary.txt. This summary http://www.iana.org/assignments/collation/summary.txt. This summary
is divided into four sections. The first section is for collations is divided into four sections. The first section is for collations
intended for common use. This section is intended for collation intended for common use. This section is intended for collation
registrations published in IESG approved RFCs or for locally scoped registrations published in IESG approved RFCs or for locally scoped
collations from the primary standards body for that locale. The collations from the primary standards body for that locale. The
designated expert is encouraged to reject collation registrations designated expert is encouraged to reject collation registrations
with an intended use of "common" if the expert believes it should be with an intended use of "common" if the expert believes it should be
"limited", as it is desirable to keep the number of "common" "limited", as it is desirable to keep the number of "common"
registrations small and high quality. The second section is reserved registrations small and high quality. The second section is reserved
for limited use collations. The third section is reserved for for limited use collations. The third section is reserved for
registered vendor specific collations. The final section is reserved registered vendor specific collations. The final section is reserved
for deprecated collations. for deprecated collations.
8.12 Example Initial Registry Summary 7.4. Example Initial Registry Summary
The following is an example of how IANA might structure the initial The following is an example of how IANA might structure the initial
registry summary.txt file: registry summary.txt file:
Collation Functions Scope Reference Collation Functions Scope Reference
--------- --------- ----- --------- --------- --------- ----- ---------
Common Use Collations: Common Use Collations:
i;octet e, o, s Other [RFC XXXX]
i;nameprep;v=1;uv=3.2 e, o, s i18n [RFC XXXX] i;nameprep;v=1;uv=3.2 e, o, s i18n [RFC XXXX]
i;basic;uca=3.1.1;uv=3.2 e, o, s i18n [RFC XXXX] i;ascii-casemap e, o, s Local [RFC XXXX]
i;basic;uca=3.1.1;uv=3.2;match=accent e, o, s i18n [RFC XXXX]
i;basic;uca=3.1.1;uv=3.2;match=case e, o, s i18n [RFC XXXX]
en;ascii-casemap e, o, s Local [RFC XXXX]
Limited Use Collations: Limited Use Collations:
i;octet e, o, s Other [RFC XXXX]
i;ascii-numeric e, o Other [RFC XXXX] i;ascii-numeric e, o Other [RFC XXXX]
Vendor Collations: Vendor Collations:
Deprecated Collations: Deprecated Collations:
i;ascii-casemap e, o, s Local [RFC XXXX]
References References
---------- ----------
[RFC XXXX] Newman, C., "Internet Application Protocol Collation [RFC XXXX] Newman, C., Duerst, M., Gulbrandsen, A., "Internet
Registry", RFC XXXX, Sun Microsystems, October 2003. Application Protocol Collation Registry", RFC XXXX,
Sun Microsystems, October 2013.
9. DTD for Collation Registration
<!- 8. Guidelines for Expert Reviewer
DTD for Collation Registration Document
Data types:
entity description
====== ===========
NUMBER [0-9]+
URI As defined in RFC 2396
CTEXT printable ASCII text (no line-terminators)
TEXT character data
->
<!ENTITY % NUMBER "CDATA">
<!ENTITY % URI "CDATA">
<!ENTITY % CTEXT "#PCDATA">
<!ENTITY % TEXT "#PCDATA">
<!ELEMENT collation (name,title,functions,specification+,owner+,
submittor*,version?,UnicodeVersion?,
UCAVersion?,UCAMatchLevel?)>
<!ATTLIST collation
rfc %NUMBER; "0"
scope (i18n|local|other) #IMPLIED
intendedUse (common|limited|vendor|deprecated) #IMPLIED>
<!ELEMENT name (%CTEXT;)>
<!ELEMENT title (%CTEXT;)>
<!ELEMENT functions (%CTEXT;)>
<!ELEMENT specification (%TEXT;)>
<!ATTLIST specification
uri %URI; "">
<!ELEMENT owner (%CTEXT;)>
<!ELEMENT submittor (%CTEXT;)>
<!ELEMENT version (%CTEXT;)>
<!ELEMENT UnicodeVersion (%CTEXT;)>
<!ELEMENT UCAVersion (%CTEXT;)>
<!ELEMENT UCAMatchLevel (%CTEXT;)>
10. Guidelines for Expert Reviewer
The expert reviewer appointed by the IESG has fairly broad latitude The expert reviewer appointed by the IESG has fairly broad latitude
for this registry. While a number of collations are expected for this registry. While a number of collations are expected
(particularly customizations of the basic collation for localized (particularly customizations of the basic collation for localized
use), an explosion of collations (particularly common use collations) use), an explosion of collations (particularly common use collations)
is not desirable for widespread interoperability. However, it is is not desirable for widespread interoperability. However, it is
important for the expert reviewer to provide cause when rejecting a important for the expert reviewer to provide cause when rejecting a
registration, and when possible to describe corrective action to registration, and when possible to describe corrective action to
permit the registration to proceed. The following table includes permit the registration to proceed. The following table includes
some example reasons to reject a registration with cause: some example reasons to reject a registration with cause:
o The registration is not a well-formed XML document.
o The registration is not a well-formed XML document that follows o The registration has an intended use of "common", but there is no
the DTD. evidence the collation will be widely deployed, so it should be
o The registration has intended use of "common", but there is no
evidence the collation will be widely deployed so it should be
listed as "limited". listed as "limited".
o The registration has an intended use of "common", but it is
redundant with the functionality of a previously registered
"common" collation.
o The registration has an intended use of "common", but the
specification is not detailed enough to allow interoperable
implementations by others.
o The registration has intended use of "common", but is redundant o The collation identifier fails to precisely identify the version
with the functionality of a previously registered "common" numbers of relevant tables to use.
collation.
o The collation name fails to precisely identify the version numbers
of relevant tables to use.
o The registration fails to meet one of the "MUST" requirements in o The registration fails to meet one of the "MUST" requirements in
Section 4. Section 4.
o The collation identifier fails to meet the syntax in Section 3.
o The collation name fails to meet the syntax in Section 3.
o The collation specification referenced in the registration is o The collation specification referenced in the registration is
vague or has optional features without a clear behavior specified. vague or has optional features without a clear behavior specified.
o The referenced specification does not adequately address security o The referenced specification does not adequately address security
considerations specific to that collation. considerations specific to that collation.
o The registration's operations are needlessly different from those
of traditional operations.
o The registration's XML is needlessly different from that of
already registered collations.
9. Initial Collations
This section describes an initial set of collations for the collation
registry.
9.1. ASCII Numeric Collation
9.1.1. ASCII Numeric Collation Description
The "i;ascii-numeric" collation is a simple collation intended for
use with arbitrary sized unsigned decimal integer numbers stored as
octet strings. US-ASCII digits (0x30 to 0x39) represent digits of
the numbers. Before converting from string to integer, the input
string is truncated at the first non-digit character. All input is
valid; strings which do not start with a digit represent positive
infinity.
The collation supports equality and ordering, but does not support
the substring operation.
The equality operation returns "match" if the two strings represent
the same number (ie. leading zeroes and trailing nondigits are
disregarded) and "no-match" if the two strings represent different
numbers.
The ordering operation returns "less" if the first string represents
a smaller number than the second, "equal" if they represent the same
number, and "greater" if the first string represents a larger number
than the second.
Some examples: "0" is less than "1", and "1" is less than
"4294967298". "4294967298", "04294967298" and "4294967298b" are all
equal. "04294967298" is less than "". "", "x" and "y" are equal.
11. Security Considerations 9.1.2. ASCII Numeric Collation Registration
<?xml version='1.0'?>
<!DOCTYPE collation SYSTEM 'collationreg.dtd'>
<collation rfc="XXXX" scope="other" intendedUse="limited">
<identifier>i;ascii-numeric</identifier>
<title>ASCII Numeric</title>
<operations>equality order</operations>
<specification>RFC XXXX</specification>
<owner>IETF</owner>
<submitter>chris.newman@sun.com<submitter>
</collation>
9.2. ASCII Casemap Collation
9.2.1. ASCII Casemap Collation Description
The "i;ascii-casemap" collation is a simple collation which operates
on octet strings and treats US-ASCII letters case-insensitively. It
provides equality, substring and ordering operations. All input is
valid.
Its equality, ordering and substring operations are as for i;octet,
except that first, the lower-case letters (octet values 97-122) in
each input string are changed to upper case (octet values 65-90).
Care should be taken when using OS-supplied functions to implement
this collation as it is not locale sensitive. Functions such as
strcasecmp and toupper are sometimes locale sensitive and may
inappropriately map lower-case letters other than a-z to upper case.
The i;ascii-casemap collation is well suited to to use with many
internet protocols and computer languages. Use with natural language
is often inappropriate: even though the collation apparently supports
languages such as Italian and English, in real-world use it tends to
stumble over words such as "naive", names such as "Llwyd", people and
place names containing non-ASCII, euro and pound sterling symbols,
quotation marks, dashes/hyphens, etc.
9.2.2. ASCII Casemap Collation Registration
<?xml version='1.0'?>
<!DOCTYPE collation SYSTEM 'collationreg.dtd'>
<collation rfc="XXXX" scope="local" intendedUse="common">
<identifier>i;ascii-casemap</identifier>
<title>ASCII Casemap</title>
<operations>equality order substring</operations>
<specification>RFC XXXX</specification>
<owner>IETF</owner>
<submitter>chris.newman@sun.com<submitter>
</collation>
9.3. Nameprep Collation
9.3.1. Nameprep Collation Description
The "i;nameprep;v=1;uv=3.2" collation is an implementation of the
nameprep [7] specification based on normalization tables from Unicode
version 3.2. This collation applies the nameprep canonicalization
function to both input strings and then returns the result of the
i;octet collation on the canonicalized strings. While this collation
offers all three operations, the ordering operation it provides is
inadequate for use by the majority of the world.
Version number 1 is applied to nameprep as specified in RFC 3491. If
the nameprep specification is revised without any changes that would
produce different results when given the same pair of input octet
strings, then the version number need not be changed.
The table numbers for tables used by nameprep are as follows:
+--------------+-----------------------+
| Table Number | Table Name |
+--------------+-----------------------+
| 1 | UnicodeData-3.2.0.txt |
| 2 | Table B.1 |
| 3 | Table B.2 |
| 4 | Table C.1.2 |
| 5 | Table C.2.2 |
| 6 | Table C.3 |
| 7 | Table C.4 |
| 8 | Table C.5 |
| 9 | Table C.6 |
| 10 | Table C.7 |
| 11 | Table C.8 |
| 12 | Table C.9 |
+--------------+-----------------------+
9.3.2. Nameprep Collation Registration
<?xml version='1.0'?>
<!DOCTYPE collation SYSTEM 'collationreg.dtd'>
<collation rfc="XXXX" scope="i18n" intendedUse="common">
<identifier>i;nameprep;v=1;uv=3.2</identifier>
<title>Nameprep</title>
<operations>equality order substring</operations>
<specification>RFC XXXX</specification>
<owner>IETF</owner>
<submitter>chris.newman@sun.com<submitter>
<version>1</version>
</collation>
9.4. Octet Collation
9.4.1. Octet Collation Description
The "i;octet" collation is a simple and fast collation intended for
use on binary octet strings rather than on character data. Protocols
that want to make this collation available have to do so by
explicitly allowing it. If not explicitly allowed, it MUST NOT be
used. It never returns an "undefined" result. It provides equality,
substring and ordering operations.
The ordering algorithm is as follows:
1. If both strings are the empty string, return the result "equal".
2. If the first string is empty and the second is not, return the
result "less".
3. If the second string is empty and the first is not, return the
result "greater".
4. If both strings begin with the same octet value, remove the first
octet from both strings and repeat this algorithm from step 1.
5. If the unsigned value (0 to 255) of the first octet of the first
string is less than the unsigned value of the first octet of the
second string, then return "less".
6. If this step is reached, return "greater".
This algorithm is roughly equivalent to the C library function memcmp
with appropriate length checks added.
The matching operation returns "match" if the sorting algorithm would
return "equal". Otherwise the matching operation returns "no-match".
The substring operation returns "match" if the first string is the
empty string, or if there exists a substring of the second string of
length equal to the length of the first string which would result in
a "match" result from the equality function. Otherwise the substring
operation returns "no-match".
9.4.2. Octet Collation Registration
This collation is defined with intendedUse="limited" because it can
only be used by protocols that explicitly allow it.
<?xml version='1.0'?>
<!DOCTYPE collation SYSTEM 'collationreg.dtd'>
<collation rfc="XXXX" scope="i18n" intendedUse="limited">
<identifier>i;octet</identifier>
<title>Octet</title>
<operations>equality order substring</operations>
<specification>RFC XXXX</specification>
<owner>IETF</owner>
<submitter>chris.newman@sun.com<submitter>
</collation>
10. IANA Considerations
Section 7 defines how to register collations with IANA. Section 9
defines a list of predefined collations, which should be registered
when this document is approved and published as an RFC.
11. Security Considerations
Collations will normally be used with UTF-8 strings. Thus the Collations will normally be used with UTF-8 strings. Thus the
security considerations for UTF-8 [3] and stringprep [6] also apply security considerations for UTF-8 [3], stringprep [6] and Unicode
and are normative to this specification. TR-36 [9] also apply and are normative to this specification.
12. Open Issues 12. Acknowledgements
1. Is any Nameprep processing appropriate for the basic collation? The authors want to thank all who have contributed to this document,
Because a result of "0" from an ordering algorithm is including at least John Cowan, Dave Cridland, Mark Davis, Lisa
undesirable, much of the nameprep processing is inappropriate. Dusseault, Frank Ellermann, Philip Guenther, Tony Hansen, Kjetil
Furthermore, a result of "error" which is important for nameprep Torgrim Homme, Michael Kay, Alexey Melnikov, Jim Melton and Abhijit
is generally inappropriate as an internal result in an ordering Menon-Sen.
algorithm since it makes the results less intuitive. The sort
key table also eliminates most problematic characters from 13. Open Issues
consideration if the appropriate collation modifier is used.
Finally, exact compatibility with the Unicode Collation Algorithm When converting this to an RFC, several things must be done: Martin
is deemed desirable by the author, as even the smallest variation Duerst's name request, checking for unfortunate page breaks, adding a
may require implementation of largely duplicate code. However, note to the RFC editor to possibly replace the 3066 reference,
this decision is outside my expertise, so I welcome alternate checking the SP SP "1" SP SP string for correctness.
viewpoints.
Why no comments from anyone in the second half of the alphabet?
2. The ICU implementation of the UCA algorithm includes additional
algorithmic customizations such as the ability to be 14. Change Log
case-sensitive while at the same time being insensitive to
accents. Should these customizations be added to this 14.1. Changes From -12
specification? 1. Remove i;basic, to publish it as a separate RFC. Many documents
are held up by this document, and this document is only help up
3. Should a format for customization data for the basic collation be by i;basic.
defined so that disconnected clients might have the option of 2. Get rid of all the typoes I could find.
downloading that information? 3. Specifically note that the "same" substring match need not always
be returned in each of its guises.
4. Need to deal with the concept of "maybe" or "indeterminate"
results from matching or ordering. See what LDAP does as an 14.2. Changes From -11
example. 1. Remove the DTD. Permit well-considered extension of the XML.
Enable the designated expert to block registrations due to
inappropriate or overly aggressive extension.
2. Rename collation names to collation identifiers. Having both
names and titles wasn't good.
3. Removed some open issues after trying to edit, and deciding that
the existing text was good.
4. Note that in Sieve, invalid strings sort after valid ones.
5. Make i;ascii-numeric as in RFC2244. The task of this document is
to establish the registry, not change existing collations.
14.3. Changes From -10
1. Updated contact details for Martin Duerst.
2. Various textual improvements.
3. The registration's file name now has a mandatory .xml extension.
4. Removed binding MUST for Sieve; it's more appropriate to put that
in 3028bis.
5. Syntax fix in registration example.
6. When there are multiple specifications, they now act in concert,
so it's possible to have e.g. a main specification and multiple
locale-specific supplements. It is not possible to name multiple
locations for the same specification any more. That'll return as
a comment feature.
7. Hopefully clearer exposition of i;ascii-casemap.
8. The ban on registering octet-based collations is lifted. One
hopes that the collation mailing list will present a suitable
threshold - not too high, not too low.
9. The DTD is published where IE can see it while looking at the
registrations.
14.4. Changes From -09
1. Rename "error" to "undefined", as suggested by Mark Davis. The
new name makes for nicer prose IMO.
2. 7b=7 according to i;ascii-numeric. ACAP/Sieve need it.
3. Clarified that even though the collation specification returns a
list of substrings, the protocol/server need not use all of that
information. (As indeed IMAP SEARCH does not.)
4. Registrations go directly to the collation list _and_ to the
IANA, not to the IANA and from there forwarded to designated
expert.
5. Added an acknowledgements list and populated it with a quick grep
from my mailbox and memory. Surely incomplete.
6. Noted that in sieve, "no-match" and "undefined" must be treated
in the same way by the engine.
7. Finish the rename from canonical to sort key.
8. Don't fall back to i;octet from any other collation. Return
undefined instead. Note that protocols may fall back to i;octet
to provide total ordering, if necessary.
9. Call the things operations everywhere, not operators/operations.
14.5. Changes From -08
1. i;ascii-casemap instead of en;ascii-casemap.
2. UCA v 14. Changing to "latest version of UCA" was suggested,
but rejected since IETF standards reference stable
specifications, and "latest" is a moving target.
3. Removed all text on multi-valued attributes. Can be added once
there is a concrete need for it, either in an update to this
document or in the protocol that needs it.
4. "Collations MUST specify the canonicalization". Well, the UCA
doesn't, so I changed that to a MAY.
5. Add some text explaining why one might want to download tables.
6. Changed the remaining instances of "canonicalization" to talk
about sort keys. Added a note that a collation's sort key need
not be valid input to the same collation.
7. Reserve the word "default" and use it to name a protocol's
default collation, provided that protocol has a default
collation. In earlier versions of the draft, "*" was used to
name the default collation, but "*" also was implicitly defined
as the most general collation available.
8. Reinstate the different-length example of substring match.
Explain what an overlapping match is, by the canonical example.
9. Avoid the word "contain" when talking about substring matches.
Fewer terms is better.
10. Until -07, both a collation and equality/substring/sort was
called functions. In -07, the trio was renamed as operations.
Now, the DTD is updated to match.
11. Appeals go to the Apps AD before the general AD, as suggested by
Spencer Dawkins.
14.6. Changes From -06
1. Clarified equality and identity: equality is as defined by a
collation, identity is stronger.
2. Added reference to
http://www.unicode.org/reports/tr10/#Searching.
3. Don't describe sort keys as a canonical representation of the
string.
4. Permit disconnected clients to use wildcards. (A disconnected
client has to resolve the wildcard itself, in the same way that a
server would.)
5. Change collation-wild to have the same length limit as collation.
6. Change to use "less" instead of "-1", etc., and specify that it's
just phrasing, not specification.
7. Don't describe the equality, substring and ordering operations as
functions. The definition of collation uses the word function
about the collation itself. A function that has three functions?
Something has to give.
8. Strike a requirement that selecting '*' is the same as not
selecting any collation. It restricted the protocol's default
too much. Existing code wasn't listening.
9. Left out the canonicalization/sort keys.
14.7. Changes From -05
1. Added definitions of client, server and protocol, and prose to
specify that while the IANA registrations of collations are
written in terms octet strings, implementations may do it
differently.
2. Changed the wording for ascii-numeric to treat the numbers as
numbers, etc.
3. Added explicit property requirements for the three functions,
e.g. that equality be symmetric. Added requirements that the
three functions be consistent, and that if any operations are
present, equality must be (needed for consistency).
4. Random editing, e.g. changing 'numbers' for ascii-numeric to
'integer numbers'.
5. Gave IMAP/SORT/COMPARATOR the same grandfather treatment as ACAP
and SIEVE.
14.8. Changes From -04
Grammar and clarity changes only. One (weak) example added. No
substantive changes.
14.9. Changes From -03
(This does not include all changes made.)
1. Checked and resolved most issues marked 'check whether this is
true' or similar.
2. Resolved nameprep issue: No.
3. Removed NULL for compatibility with existing collations (IMAP
SORT, Sieve).
4. There can be multiple owners and submitters. Say how.
5. Added a requirement that common collations must now be
interoperable. Insufficiently detailed specs cannot be "common".
6. Added a guideline that the operations provided by new collations
should be reminiscent of similar operations on existing
collations.
14.10. Changes From -02
1. Changed from data being octet sequences (in UTF-8) to data being
character sequences (with octet collation as an exception).
2. Made XML format description much more structured.
3. Changed <submittor> to <submitter>, because this spelling is much
more common.
4. Defined 'protocol' to include query languages.
5. Reorganized document, in particular IANA considerations section
(which newly is just a list of pointers).
6. Added subsections, and a 'Structure of this Document' section.
7. Updated references.
8. Created a 'Change Log' chapter, with sections for each draft.
9. Reduced 'Open issues' section, open issues are now maintained at
http://www.w3.org/2004/08/ietf-collation.
13. Changes From -00 14.11. Changes From -01
Add IANA comment to open issues. Otherwise this is just a re-publish
to keep the document alive.
14.12. Changes From -00
1. Replaced the term comparator with collation. While comparator is 1. Replaced the term comparator with collation. While comparator is
somewhat more precise because these abstract functions are used somewhat more precise because these abstract functions are used
for matching as well as ordering, collation is the term used by for matching as well as ordering, collation is the term used by
other parts of the industry. Thus I have changed the name to other parts of the industry. Thus I have changed the name to
collation for consistency. collation for consistency.
2. Remove all modifiers to the basic collation except for the 2. Remove all modifiers to the basic collation except for the
customization and the match rules. The other behavior customization and the match rules. The other behavior
modifications can be specified in a customization of the modifications can be specified in a customization of the
collation. collation.
3. Use ";" instead of "-" as delimiter between parameters to make 3. Use ";" instead of "-" as delimiter between parameters to make
names more URL-ish. names more URL-ish.
4. Add URL form for comparator reference. 4. Add URL form for comparator reference.
5. Switched registration template to use XML document. 5. Switched registration template to use XML document.
6. Added a number of useful registration template elements related 6. Added a number of useful registration template elements related
to the Unicode Collation Algorithm. to the Unicode Collation Algorithm.
7. Switched language from "custom" to "tailor" to match UCA language 7. Switched language from "custom" to "tailor" to match UCA language
for tailoring of the collation algorithm. for tailoring of the collation algorithm.
Normative References 15. References
15.1. Normative References
[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", BCP 14, RFC 2119, March 1997. Levels", BCP 14, RFC 2119, March 1997.
[2] Crocker, D. and P. Overell, "Augmented BNF for Syntax [2] Crocker, D. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", RFC 2234, November 1997. Specifications: ABNF", RFC 4234, October 2005.
[3] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC [3] Yergeau, F., "UTF-8, a transformation format of ISO 10646",
2279, January 1998. STD 63, RFC 3629, November 2003.
[4] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform Resource [4] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
Identifiers (URI): Generic Syntax", RFC 2396, August 1998. Resource Identifier (URI): Generic Syntax", RFC 3986,
January 2005.
[5] Alvestrand, H., "Tags for the Identification of Languages", BCP [5] Alvestrand, H., "Tags for the Identification of Languages",
47, RFC 3066, January 2001. BCP 47, RFC 3066, January 2001.
[6] Hoffman, P. and M. Blanchet, "Preparation of Internationalized [6] Hoffman, P. and M. Blanchet, "Preparation of Internationalized
Strings ("stringprep")", RFC 3454, December 2002. Strings ("stringprep")", RFC 3454, December 2002.
[7] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for [7] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for
Internationalized Domain Names (IDN)", RFC 3491, March 2003. Internationalized Domain Names (IDN)", RFC 3491, March 2003.
[8] Davis, M. and K. Whistler, "Unicode Collation Algorithm version [8] Davis, M. and K. Whistler, "Unicode Collation Algorithm version
9", July 2002, <http://www.unicode.org/reports/tr10/ 14", May 2005,
tr10-9.html>. <http://www.unicode.org/reports/tr10/tr10-14.html>.
[9] Davis, M. and M. Suignard, "Unicode Security Considerations",
February 2006, <http://www.unicode.org/reports/tr36/>.
Informative References 15.2. Informative References
[9] Freed, N. and N. Borenstein, "Multipurpose Internet Mail [10] Freed, N. and N. Borenstein, "Multipurpose Internet Mail
Extensions (MIME) Part One: Format of Internet Message Bodies", Extensions (MIME) Part One: Format of Internet Message Bodies",
RFC 2045, November 1996. RFC 2045, November 1996.
[10] Myers, J., "Simple Authentication and Security Layer (SASL)", [11] Myers, J., "Simple Authentication and Security Layer (SASL)",
RFC 2222, October 1997. RFC 2222, October 1997.
[11] Newman, C. and J. Myers, "ACAP -- Application Configuration [12] Newman, C. and J. Myers, "ACAP -- Application Configuration
Access Protocol", RFC 2244, November 1997. Access Protocol", RFC 2244, November 1997.
[12] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
Considerations Section in RFCs", BCP 26, RFC 2434, October
1998.
[13] Resnick, P., "Internet Message Format", RFC 2822, April 2001. [13] Resnick, P., "Internet Message Format", RFC 2822, April 2001.
[14] Freed, N. and J. Postel, "IANA Charset Registration [14] Freed, N. and J. Postel, "IANA Charset Registration
Procedures", BCP 19, RFC 2978, October 2000. Procedures", BCP 19, RFC 2978, October 2000.
[15] Showalter, T., "Sieve: A Mail Filtering Language", RFC 3028, [15] Showalter, T., "Sieve: A Mail Filtering Language", RFC 3028,
January 2001. January 2001.
URIs [16] Crispin, M., "Internet Message Access Protocol - Version
4rev1", RFC 3501, March 2003.
[16] <http://www.unicode.org/Public/3.2-Update/ [17] Crispin, M. and K. Murchison, "Internet Message Access Protocol
UnicodeData-3.2.0.txt> - Sort and Thread Extensions", draft-ietf-imapext-sort-17.txt
(work in progress), May 2004.
[18] Newman, C. and A. Gulbrandsen, "Internet Message Access
Protocol Internationalization", draft-ietf-imapext-i18n-06.txt
(work in progress), January 2006.
[17] <http://www.unicode.org/Public/3.2-Update/PropList-3.2.0.txt> Authors' Addresses
[18] <http://www.unicode.org/reports/tr10/allkeys-3.1.1.txt>
Author's Address
Chris Newman Chris Newman
Sun Microsystems Sun Microsystems
1050 Lakes Drive 1050 Lakes Drive
West Covina, CA 91790 West Covina, CA 91790
US US
EMail: chris.newman@sun.com Email: chris.newman@sun.com
Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever possib
le, for example as "D&#252;rst" in XML and HTML.)
Aoyama Gakuin University
5-10-1 Fuchinobe
Sagamihara, Kanagawa 229-8558
Japan
Phone: +81 42 759 6329
Fax: +81 42 759 6495
Email: mailto:duerst@it.aoyama.ac.jp
URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/
Arnt Gulbrandsen
Oryx Mail Systems GmbH
Schweppermannstr. 8
Munich 81671
Germany
Phone: +49 89 4502 9757
Fax: +49 89 4502 9758
Email: mailto:arnt@oryx.com
URI: http://www.oryx.com/arnt/
Intellectual Property Statement Intellectual Property Statement
The IETF takes no position regarding the validity or scope of any The IETF takes no position regarding the validity or scope of any
intellectual property or other rights that might be claimed to Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights this document or the extent to which any license under such rights
might or might not be available; neither does it represent that it might or might not be available; nor does it represent that it has
has made any effort to identify any such rights. Information on the made any independent effort to identify any such rights. Information
IETF's procedures with respect to rights in standards-track and on the procedures with respect to rights in RFC documents can be
standards-related documentation can be found in BCP-11. Copies of found in BCP 78 and BCP 79.
claims of rights made available for publication and any assurances of
licenses to be made available, or the result of an attempt made to Copies of IPR disclosures made to the IETF Secretariat and any
obtain a general license or permission for the use of such assurances of licenses to be made available, or the result of an
proprietary rights by implementors or users of this specification can attempt made to obtain a general license or permission for the use of
be obtained from the IETF Secretariat. such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary copyrights, patents or patent applications, or other proprietary
rights which may cover technology that may be required to practice rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF Executive this standard. Please address the information to the IETF at
Director. ietf-ipr@ietf.org.
Full Copyright Statement Disclaimer of Validity
Copyright (C) The Internet Society (2003). All Rights Reserved. This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
This document and translations of it may be copied and furnished to OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
others, and derivative works that comment on or otherwise explain it ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
or assist in its implementation may be prepared, copied, published INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
and distributed, in whole or in part, without restriction of any INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
kind, provided that the above copyright notice and this paragraph are WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing Copyright Statement
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of Copyright (C) The Internet Society (2006). This document is subject
developing Internet standards in which case the procedures for to the rights, licenses and restrictions contained in BCP 78, and
copyrights defined in the Internet Standards process must be except as set forth therein, the authors retain all their rights.
followed, or as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assignees.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Acknowledgment Acknowledgment
Funding for the RFC Editor function is currently provided by the Funding for the RFC Editor function is currently provided by the
Internet Society. Internet Society.
 End of changes. 130 change blocks. 
819 lines changed or deleted 1082 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/