Xesam End User Search Language

Introduction

This is a proposal for an end-user search language for desktop applications, not a full fledged query language. Also since this language is targeting end users it should be kept as simple as possible.

It is a deliberate approach not to allow nested queries such as hello and (world or internet). I claim that it is a very limited user base that would ever dream of doing this. Also - there's nothing in the spec preventing some search engines to support it. Usability studies support this. Users simply start a new query instead of adding logical operations.

It is designed as an extended synthesis of Apple's spotlight and Google's search languages.

Goals

Specification

A query written in the End User Search Language is a valid UTF-8 string conforming to the structure defined by the following diagram (note that the diagram in fact provides a rigorous definition of the language, details follow) (svg source):

http://grillbar.org/xesam/end-user-sl.png

Word, Select, and Phrase are commonly referred to as Terms.

If multiple terms are provided the default operation is to AND them together.

Old versions of language structure drawings: png and svg

Word

A string without white spaces (i.e. a word).

Select

A select term is a tuple <keyword><relation>. A keyword is a mapping from a word to a set of metadata fields to search. The relation can be surrounded by any number of white spaces on each side. For possible relations see below.

The keyword mapping is constructed via the xesam ontology (or other installed ontologies) plus some additional aliases. By default a keyword will match the corresponding entry in the xesam ontology without the xesam namespace. Ie

If there is no such field in the ontology other ontologies may be searched at the discretion of the search engine. To supplement this mapping and make the language more user friendly a set of aliases are provided. They include the following

Alias

Searched fields

ext

xesam:fileExtension

format

xesam:mimeType

mime

xesam:mimeType

tag

xesam:userKeyword

type

Special see below. Match content or source type

The keyword to field name map should be case insensitive, thus the keyword usercomment should match the field name xesam:userComment.

The relation is a comparison operator. The following are allowed

Relation

Description

=

Equality. Case insensitive on strings.

:

Value is contained in keyword

<=

Only well defined for dates and integer/floats. Undefined (but allowed) on strings.

>=

Same as <=

<

Same as <=

>

Same as <=

(as noted above, relations may be surrounded by any number of white spaces on each side)

Phrase

Any string enclosed in quotes. You can append modifiers immediately after the final quote. A modifier is a single letter, and you can list any number of modifiers.

Modifiers does not have to be respected, but must not cause parse errors. They are an optional extension. If a modifier is unsupported it is up to the service implementation to ignore it or handle it with best effort. The following query should match any object with the words hello, world, and printf, case sensitively, within ten words of each other:

"hello world printf"cp

Some search engines take a parameter to things like fuzziness, but these can't be tweaked from the xesam search language - the search engine should use sane default values where needed.

With some modifiers the phrase is not considered a phrase as such, merely a sequence of words (as in example above). This is hinted in the Input column.

Modifier

Input

Description

b

phrase

Boost. Any match on the phrase should boost the score of the hit significantly

c

phrase

Case sensitive

C

phrase

Case insensitive

d

phrase

Diacritic sensitive

D

phrase

Diacritic insensitive

e

phrase

Exact match. Short for cdl

f

phrase

Fuzzy search

l

phrase

Don't do stemming

L

phrase

Do stemming

o

words

Ordered words. The words in the string should appear in order, but not necessarily next to each other

p

words

Proximity search. The words in the string should appear close to each other (suggested default: 10)

r

special

The phrase is a regular expression

s

words

Sloppy search. Not all words need to match (suggested default slack: floor(sqrt(num_words)))

w

words

Word based matching. Match words inside other strings if there is some meaning full word separation. Fx "case"w matches CamelCase

Collectors

Collector

Representations

Logical AND

AND, and, &&

Logical OR

OR, or, ||

+ and -

Since we default to anding together + is ignored. - means "AND NOT".

The Type Selector

The value of the type selector indicates what types of items the search should include. The value is matched to a xesam category - ie both sources and contents are allowed. Like other keywords the namespace is omitted and the match should be case insensitive. Default namespace is xesam. To search only in xesam:Audio content use, fx:

type:audio hendrix

You can search within a specific source type as well, like (here xesam:File):

type:file algorithm

To help users there is a convenience set of aliases for the category values like we have for fields:

Category Alias

Real Category

music

xesam:Audio

picture

xesam:Image

attachment

xesam:EmailAttachment

Note: Since the type selector allows you to query "either or source or content" it has no clean mapping to the XesamQueryLanguage95. You can however easily create a clean map if the engine supports the category extension. Since this is a very isolated corner case, it is not considered a big problem.

Examples

Match any document containing the words "hello" and "world" disregarding letter casing:

hello world

Match any document containing "hello world" as one string disregarding letter casing.

"hello world"

Match any document that contains the words "hello" and "world" close to each other, in that order, and taking letter casing into account:

"hello world"cpo

Match any document of type "music" (which maps to xesam:Audio by the category aliases) with the contents of xesam:creator, or any child field here of, matching the string "Jimi Hendrix" disregarding letter casing:

type:music creator="Jimi Hendrix"

or alternatively with spacing around the relation elements of the select terms

type : music creator ="Jimi Hendrix"

Find all images that has "flower" somewhere in its keywords (fx. "flower-red"), that matches a full text search on "africa":

type:image tag:flower africa

XesamUserSearchLanguage95 (last edited 2008-03-27 23:30:09 by MikkelKamstrupErlandsen)