Monday, April 14, 2014

Week 11 - Thesauri, Controlled Vocabularies and Metadata

Websites and intranets, as the names suggest, involve nests and webs and inter/intra-connections of systems, data and information which inter-act with each other. Making sense out of these systems and information mumbo jumbo independently can be very tricky, sometimes impossible, and even with the use of reductionism. Controlled vocabularies and Metadata permit the IA to peruse through the network of relationships between these systems. They provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other forms of knowledge organization systems.

A controlled vocabulary is any defined subset of natural language. It is a list of equivalent terms in the form of a synonym ring, or a list of preferred terms in the form of an authority file. Controlled vocabulary schemes mandate the use of predefined, authorized terms that have been preselected by the designer of the vocabulary, in contrast to natural language vocabularies, where there is no restriction on the vocabulary.

Synonym rings connect a set of words that are defined as equivalent for the purposes of data retrieval. These rings can be used when a user enters a search term into a query, if the word is contained in a synonym ring then the result will contain all the words within the ring as-well. Therefor these rings can dramatically improve search results by increasing the amount of recall of the search.

Authoritative files are lists of preferred terms or accepted values. They help in keeping accurate and consistent systems by reducing the allowed terms for a set domain. They can include a synonym ring with one of the words select as being the preferred term to use. These files can be useful with regards to indexes by making sure information that can belong to similar terms can be categorized into only one category; rather than spread over several. They can also be used to guide people into using the preferred term over others, for example when the variant term in an index is linked to a preferred term.

Classification schemes are a hierarchical arrangement of preferred terms aka Taxonomy (Hierarchy). These schemes can be used in either front end (such-as the listing of a category on the results of a search in yahoo or google) or back-end (such-as organizing and indexing tags used by IAs, authors and Architects). There are many schemes that can be used to classify the same information. The choice of scheme depends on its intended application.

Metadata is data about other data. It can be used in any sort of media to describe its contents and give it additional information. It is definitional data that provides information about or documentation of other data managed within an application or environment or system. Metadata is usually stored behind the scenes. Metadata tags are used to describe documents, pages, images, software, video and audio files, and other content objects for the purposes of improved navigation and retrieval. One example of Metadata in use is within web pages tag where it can be used freely to add additional information describing the pages content. This data can be used to help improve navigation and information retrieval on the page. Controlled vocabularies are basically a defined subset of a language. Controlled vocabularies are used to reduce the variability of expressions used to characterize an item. It can come in the form of an authoritative file or a list of equivalent terms.

Thesauri are collections of categorized concepts, denoted by words or phrases that are related to each other by narrower terms; wider terms and related term relations. They are a book of synonyms, of including related and contrasting words and antonyms. They allow for synonymous management by providing the preferred term amongst many variants. It uses semantic relationships: Equivalence (like terms), Hierarchical (sub categories), and Associative (related terms). They come in three forms
·        Classic- Full functional include indexing and searching
·        Indexing- Allows indexes of preferred terms
·        Searching -Is used at the point of searching not indexing to manipulate the search performed. Users may be able to specify their search terms by going narrower or broader.

The IA will need to decide which of the above three forms to include in their site or intranet if they choose to use a thesaurus. This decision should be based on how you intend to use the thesaurus, and will definitely have major implications for design.

The thesaurus sets itself apart from the simpler controlled vocabularies in its rich array of semantic relationships. These relationships are of three types – Equivalence, Hierarchical and Associative. When a number of terms represent the same concept, the equivalence relationship clarifies which indexing term should be used. Hierarchical relationship indicates the superordination and subordination of each preferred term. This kind of relationship divides the information space into categories and subcategories, relating broader and narrower concepts through the familiar parent-child relationship. The associative relationship is a relationship between two concepts which do not belong to the same hierarchical structure, although they have semantic or contextual similarities. The relationship must be made explicit because it suggests to the indexer the use of other indexing terms with connected or similar meanings which could be used for indexing or searches. This relationship is often the trickiest, and by necessity is usually developed after the IA has made a good start on the other two relationship types. They are usually strongly implied semantic connections that aren’t captured within the equivalence or hierarchical relationships.

Faceted Classification is an analytic-synthetic classification scheme. It classifies objects using multiple taxonomies that express their different attributes or facets rather than classifying using a single taxonomy. A faceted classification system allows the assignment of an object to multiple taxonomies (sets of attributes), enabling the classification to be ordered in multiple ways, rather than in a single, predetermined, taxonomic order. A facet comprises "clearly defined, mutually exclusive, and collectively exhaustive aspects, properties or characteristics of a class or specific subject". For example, a collection of books might be classified using an author facet, a subject facet, a date facet, etc. Faceted classification is used in faceted search systems that enable a user to navigate information along multiple paths corresponding to different orderings of the facets. This contrasts with traditional taxonomies in which the hierarchy of categories is fixed and unchanging. In other words, once information is categorized using multiple facets, it can also be retrieved using multiple facets. Thus, a user would not be restricted to one identifying search term in order to retrieve an item. He or she could use a single term or link together multiple terms which increases his or her chances of retrieving the exact information that is being sought.  Another real life implementation can be seen in http://wine.com in which the various wine facets are type (red – merlot, pinot nor, malbec, white – chrdonnay, muscadot, sparkling, etc…), region of origin (South African, Argentinan, Carlifonian, Spanish, French, etc…), Winery/manufacturer (Clos du Bois, Blackstone, etc…), Year (1968, 1996, 2002, 2014, etc…) and price ($5.99, $9.99, $39.99, $156, etc…). This type of classification provides power and flexibility. The interface can be tested and refined over time, while the faceted classification provides an enduring foundation.

The Guided Navigation model encourages users to refine or narrow their own searches based on metadata field s and values built atop faceted classifications. Guided navigation has become the de facto standard for e-commerce and product-related Web sites, from big box stores to product review sites. But e-commerce sites aren’t the only ones joining the facets club. Other content-heavy sites such as media publishers (e.g. The Financial Times), libraries (such as NCSU Libraries) and even non-profits (Urban Land Institute) are tapping into faceted search to make their often broad range of content more findable. Essentially, guided navigation or faceted search has become so ubiquitous that users are not only getting used to it, they are coming to expect it.



No comments: