Item search behavior using Author API

Introduction

Items you create are saved into your Item bank. In order to use them in Activities, or find them again later to modify or review them you may need to search for the Item. Items can contain a lot of data, but only some of which would be unique enough to warrant searching. On this page, we'll discuss the searchable data and the search behavior.

It is especially useful to have an understanding of how reference and title-based searches work in order to maximize efficiency.

Search fields

As above - there are a lot of data with an Item but only some unique enough to use in a search. These fields are: 

  • Reference 
  • Title
  • Tags
  • Content
  • Widget (Question/Feature) types
  • Workflow State
  • Status

Reference

References are unique identifiers of Items. Due to this uniqueness, they are considered random and therefore not “human-readable”. Our default Item references are UUIDs - a 36 character string of (almost) guaranteed uniqueness.

Tokens

In order to support finding Items by reference, we split them up into many sub-strings called "tokens" (Using an ngram analyzer).

The length of these tokens are between 4 and 12 characters, inclusive. If you are interested to know why these token lengths were chosen, see the last section on this page.

That means for any given reference, we'll split it into every possible (sequential) substrings, of lengths 4 characters, 6 characters....etc, up to 12 characters. 

E.g. a LRN_REF_1 reference would generate the following tokens:

['LRN_', 'RN_R', 'N_RE', '_REF', 'REF_', 'EF_1', 'LRN_R', 'RN_RE', 'N_REF', '_REF_', 'REF_1', 'LRN_RE', 'RN_REF', 'N_REF_', '_REF_1', 'LRN_REF', 'RN_REF_', 'N_REF_1', 'LRN_REF_', 'RN_REF_1', 'LRN_REF_1']

So a 9 character reference generates 21 tokens.

Our default UUIDs, e.g. b040fea1-2627-42a7-ad42-2762169eccf1 generates 261 tokens.

We support references up to 150 characters, and such references will generate 1287 tokens.

Token search

When searching for a reference, the exact search term (i.e. we do not tokenize it) is compared against all of these tokens for each Item. E.g. if the search string is REF_1, this value is looked for in the generated tokens. If matches one of the tokens, then this is returned along with any other Items which match. 

In this case, the search string can be anywhere in the reference.

Beginning character search (Aka right wildcard search)

At the same time as token search, we also do search for Items that "begin with" the provided search term. This is also known as a right wildcard search. So for LRN it will search for all references which start with this string.

There are 2 reasons we also do this search alongside token search. These are to support searching by:

  1. The entire reference. References typically are >36 characters, i.e. outside our token length range, and so this is needed to find them
  2. Search terms of lengths outside of our token lengths. E.g. with a 3, or say 15 character search string. This must include the beginning part of the reference, as we indicate in the UI.

mceclip0.png

Title

For more information about Titles, please see the related Help article here.

Titles are intended to be a "human-readable" identifier for Items. They are non-unique and generally do not (and should not) contain special characters.

Tokens - titles

Titles are also split into "tokens", but instead of using an ngram analyzer, it uses a language analyzer (as above, they are meant to be human-readable).

The analyzer splits it the title according to English grammar, e.g. math level 3 semester 1 would be split to math , level , 3 , semester (same even if it’s hyphenated instead of spaces).

Note 1: Hyphens (-) are treated as spaces too, so would generate the same tokens.

Note 2: Periods (.) are slightly more complicated to understand. It's split on periods only if you have something other than a letter on either side of the .. E.g. a.b is not split, whereas 1.a or a.1 are split. It's not recommended to use . though, given titles are meant to be readable.

Tokens - search term

When the search is performed, the search term is also split up the same way. E.g. if the search term is math semester, this will be split into math and semester.

Token search

The search now does a logical AND search using these tokens: must match math AND semester. The tokenized Item titles are then searched, and as the above title example contains both of the search terms, it's returned.

Beginning character search (Aka right wildcard search)

As we do for references, we also do a right wildcard search for the provided (exact) search term.

This is for a similar reason - that tokens may not exist for the provided term. This could be because periods (.) were used and weren't split on for example, or the exact title may be used.

Tags

When creating content, Tags can be added to improve searching. See more information about tagging here.

Typing at least 3 characters in a tag parameter field will return Tag suggestions in a dropdown. The suggestions will match the typed characters in any word occurring in the Tag type or Tag name.

When you select a Tag and search, Items containing the Tag (only exact matches!) will be returned. The search behavior is different whether you choose Match all or Match at least one. If Match all is chosen, only Items that contain all the selected Tags will be returned (logical AND). On the other hand, if Match at least one is chosen, Items only have to contain one of the selected Tags (logical OR).

Note: If Tags are provided in the Init options for filtering, then both the tags from the search and the Tags from the init options are used to search. If ALL or EITHER Tags are provided, we also support None via init options - i.e. Items must not contain the provided Tag. See more information here.

Content

Content is the most diverse and contains a lot of unique data. Therefore, when an Item is created we generate searchable content based on the below information only:

Item level data:

  • Acknowledgments
  • Description 
  • Note
  • Source

Widget level data:

  • Stimulus
  • Shared passage header and content
  • Template content (The template field in certain question types, such as token highlight)

Question / Feature Type

Items can be searched for by the Question or Feature type widget that it contains. I.e. if a user selects a Question/Feature Type as a search parameter, such as MCQ, only Items containing at least one MCQ question will be returned.

Workflow State

For organizations that have the Item Review Workflow enabled, Workflow states are available as search criteria. Selecting a Workflow State will return only those Items which are in the selected state.

Status

Selecting a Status allows a user to filter Items by their status. The statuses are "Published", "Unpublished" and "Archived". 

Appendix

Token lengths

These min and max token lengths were chosen based on analysis of references in our database and a balance between maximum search-ability and space.

UUIDs have sections/sub-strings (split by hyphens) of 4 characters so the majority of references don't contain references with 3 character sections. If we had supported 3 character tokens, a UUID would generate 295 tokens - an additional 34 tokens per reference, and thus cost more.

Likewise, with the upper limit, there were very few references with more than 12 character substrings, and so is little benefit to generating tokens longer than that.

Was this article helpful?

Did you arrive here by accident? If so, learn more about Learnosity.