Thesaurus alignment task

Lorna Balkan

We have just finished our initial task of aligning HASSET and ELSST in order that we can move forward with updating ELSST.

Aims

The original aim of the alignment task was to bring ELSST (or more precisely, its source language (English) version) up to date with HASSET. ELSST is derived from HASSET but over the years they have grown further apart as HASSET has developed at a faster rate than ELSST.

We initially thought that we could merge ELSST and HASSET. ELSST would in effect be a subset of HASSET- all ELSST concepts would be in HASSET and all shared concepts would be identical.  As the alignment work progressed, however, we came to the conclusion that this goal is too restrictive for both thesauri, and that we must allow for divergences between them.

Methodology

First, we drew up a table of inconsistencies, arranged by concept, that showed where concepts and their associated metadata differed in the two thesauri. Our main focus was terms and relationships that were in ELSST, not HASSET.  These were analysed by the thesaurus team at the UK Data Archive, and changes made to HASSET where possible, since these could be implemented immediately. Over seventy changes were made to HASSET as a result.

Other changes to ELSST were proposed and sent to our ELSST partners for approval. Many of the suggestions involved deleting non-preferred terms in ELSST that exist as narrower preferred terms in HASSET, since our original architecture would not have permitted this.  However, from the feedback we received, it became clear that these non-preferred terms are often useful in ELSST.  Other concepts that proved challenging for concept identity related to systems such as the legal or social welfare systems, which differ across countries.

For these reasons, and in order to preserve the integrity of each thesaurus, we needed to revise our goal of absolute concept identity and, instead, embrace the idea that shared concepts in the two thesauri may be allowed to diverge in certain respects. The goal shifted from seeking identity of concepts to identifying cases where the two thesauri may legitimately differ, and mapping this.  This is ongoing work.

Future planning

Our goal remains, however, to keep the two thesauri identical as far as possible. For example, their higher level hierarchical structures will be identical.  Going forward, it will be important to keep track of divergences between HASSET and ELSST. Bringing the two thesauri together into the same database will make this easier. While it may be possible to automate some rules to prohibit certain types of divergence (e.g. a term may not be a BT to a term in one thesaurus, while at the same time being an RT to the same term in the other), most decisions about what constitutes a legitimate divergence are likely to be a matter of human judgement.

Advertisements
This entry was posted in Technical. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s