Release of ELSST Thesaurus Management System Version 2

Jeannine Beeken

Summary

We are very pleased to announce the release of the Thesaurus Management System Version 2 on 18 December 2018.

After three successful new releases of the ELSST content in September 2016, 2017 and 2018, the next step was to evaluate, revise and enhance all of the functionalities of the thesaurus management system (TMS), the online application and user interface.  Requirements were gathered from translators and administrators via a survey and all users were invited to give their feedback.

Major improvements and enhancements have been made to the ‘search/browse’ functions, the ‘history’ section and the ‘suggestions’ area.  User-oriented improvements include the introduction of 13 language-specific stemmers, search with or without diacritics, additional filter and retrieval options, more ways to sort results, and various export/download options.

Stemmers reduce words to their stem or root form, for example: ‘politics, political, politicians’ are reduced to their stem ‘politic’. Diacritics affect the pronunciation of a word and appear as a mark on letters such as È  Ï  ļ  Ň  Ö  Ž.

Some of the ELSST functionalities, for example ‘History’ and ‘Suggestions’ are only available to logged-in users.  If you do not have an account, you can create one at https://elsst.ukdataservice.ac.uk/, by choosing Login to ELSST and following the instructions.

More information about the structure of ELSST is available at https://elsst.ukdataservice.ac.uk/elsst-guide/elsst-structure.aspx

More details on the major revisions and enhancements are given below.

Your feedback is very welcome, just send a mail to thesaurus@ukdataservice.ac.uk.

 

  1. ELSST Home page

The search function has been improved by adding the Autosuggest function to the Search box.  This function uses language-specific stemmers to find the first 10 Preferred Terms (PTs) followed by the first 10 Use For terms (UFs); the numbers depend on the availability of terms. Selecting one of the terms proposed by Autosuggest leads you directly to the ‘View concept page’ which displays the tree structure together with the language equivalents, scope note, UFs etc. Searches are diacritic-insensitive.

However, you do not have to select one of the suggested terms, you can also type a term and press the GO button.  This leads you to the ‘Thesaurus search’ page, where all terms selected by the language-specific stemmer are listed.

If you have translator rights, you can get access to the Translators area by clicking on the appropriate link. There you will find the Translation guidelines and training material.

 

  1. ELSST search

 ‘Thesaurus search’ page

Previously, you could filter your search by unticking the ‘Preferred Term’ box or the ‘Use for’ box. This function is still valid, and a new function has been added to it.  When you are a logged-in user, you can now broaden your search to the different notes (scope notes, use notes etc.).  For example, searching for the term POLITICS, lists about 50 results which are based on the English stemmer. Broadening your search by ticking ‘Scope note’ lists about 65 results. The additions are listed at the end of the list and comprise for example CONSTITUTIONS, NATIONAL IDENTITY and LOBBYING. Another example is EDUCATION. The following terms were not found by the search for PTs and UF only: BUSINESS AND ADMINISTRATION STUDIES, SCHOOL-LEAVING, ACADEMIC ABILITY, SECONDARY SCHOOLS, GAP YEAR, SPECIAL NEED STUDENTS etc., the reason being that the terms EDUCATION, EDUCATIONAL etc. are not part of the string.  However, these terms are mentioned in its scope and use notes.

Secondly, translators are now able to mark their translations as ‘Untranslatable’ if, as happens rarely, a certain term/concept is not part of their language, although it can be used by most or all of the other 12 ELSST languages. Examples of difficult or impossible translations are RIGHT OF WAY and HOUSE HUSBANDS.

To aid search and browse, users can now list the results according to initial letter, for example E, È, Δ, Ω.

Finally, the languages for which the translation of a specific preferred term has been completed are now displayed in Next version as well as Current version. Hovering over the abbreviations displays the translated term next to the English term.

  ‘View concept’ page

There are also some new features and functionalities on the ‘View concept’ page.

Firstly, the visual graph view of a concept is now available in Next Version as well as Current version. (accessible to logged-in users only).

Secondly, the landing page contains expanded and collapsed boxes.  Already opened are ‘Preferred term’, ‘Language equivalents’, ‘Use for’, ‘Scope note’. These fields hold the basic information about a concept.  All of the other fields can be expanded too (click on ‘+’), for example ‘Broader terms’, ‘History note’, ‘Links’. The links lead to ‘Version History’, i.e. the history of changes made to a term (available to logged in users only), and to ‘SKOS Concept’ (in Current version only).

 

  1. ELSST history

The history pages contain a list of all the changes made to each of the 13 languages, identifying the concept that changes were made to, the type of changes, by whom and when.  ‘History’ has been revised and improved hugely in order to enable our users and translators to get an overview of all the changes made, for example a list of all new Preferred Terms or a list of all the changes made during a specific period of time. The improved ‘History’ will be especially useful to translators as it will help them to keep their language versions aligned and up-to-date with the English version. However, as before, only logged-in users can consult the History page. The major changes are:

  • A ‘Language’ filter has been introduced enabling the user to select the changes made to a specific language. For example, selecting ‘Finnish’ displays the changes made to Finnish only.
  • Changes made during a specific time period, for example, changes made from 1 February 2018 to 3 March 2018, can be selected by using the ‘Date range’ filter.
  • A new filter ‘Action type’ has been introduced. For example, ticking ‘PT added’ and ‘RT removed’ displays a list of all PTs that have been added and RTs that have been removed, in chronological order.
  • The results page has been revised completely and contains information about ‘PT edited’, ‘Action’, ‘Change’, ‘Language’, ‘Date’, Who’. Users can sort their results according to each column.
    The column ‘PT edited’ identifies the preferred term that changes have been made to, which is empty in the case of a new PT.  ‘Action’ records the changes made (e.g. removing an RT relationship, changing the scope note, adding a UF).  The ‘Change’ column displays the result of the action.
  • Results can be downloaded as Excel or PDF document; they can also be printed.
  • Hovering over the tooltips displays help information

 

  1. ELSST suggestions

‘Suggestions’ page

The ‘Suggestions’ page has been thoroughly revised making it much easier for translators and other users to suggest a change or add a comment to a suggested change. Examples of the types of suggestions are ‘add PT’, ‘remove BT’, ‘edit scope note’. Suggestions can be made for each of the 13 available languages. However, as before, only logged-in users can consult and contribute to the Suggestion page. The major changes are:

  • New filters have been added, including a ‘Language’ filter that enables the user to select only those suggestions made for a specific language. For example, selecting ‘Czech’ displays the suggestions for Czech only.
  • A new date filter allows suggestions to be selected made during a specific time period, for example, from 1 February 2018 to 3 March 2018, can be selected by using the ‘Date range’ filter.
  • The new ‘Status’ filter indicates which suggestions are ‘Under discussion’ (default), or have been ‘Accepted’ or ‘Rejected’.
  • ‘Action type’ contains a list of controlled terms that indicate the type of suggestion. For example, ticking ‘add PT’ and ‘remove RT’ displays a list of all PTs for which it has been suggested to ‘Add PT’ or ‘Remove RT’, in chronological order. Users can also indicate whether they would like their suggestion to be considered for ELSST only or for both the ELSST and HASSET thesauri (English thesaurus).
  • The results page has been revised completely and contains information about ‘Concept suggestion’, ‘Action’, ‘Language’, ‘Status’, ‘Core’, ‘Date’, Who’. Users can sort their results according to each column.
  • Concept Suggestion identifies an existing or a new preferred term or concept the suggestion is about.
  • Action stores information about the type of action suggested. For example, removing an RT relationship, changing a scope note, adding a UF.
  • Language displays the suggestions made for a specific language, for example ‘ES’ (Spanish).
  • Status displays the status of the suggestion, for example ‘Rejected’ or ‘Accepted’.
  • Core: No means that the suggestion is for either ELSST or HASSET only.
  • Date and Who specify when and by whom the suggestion was made.
  • Results can be downloaded as an Excel or PDF document or they can be printed.

‘View suggestion’ page

To read more about a suggestion, including any comments on it, click on the suggestion on the Suggestion page. Users can also add a comment of their own.

While the author of the suggestion is allowed to re-edit his/her suggestion, others can only add one or more ‘Comments’. It goes without saying that only Administrators can access the Administration area, which contains for example a suggestion’s status (Under discussion, Accepted, Rejected).

‘Suggest a change’ page

The ‘Suggest a change’ page has been revised fully and has been improved in terms of user-friendliness.

Users should first select a ‘Language’ to clarify which language their suggestion is meant for. This also changes the tree view of the thesaurus on this page to that language.

Each suggestion needs to be identified by a ‘Concept suggestion’ and an ‘Action type’ (both mandatory).  The ‘Discussion’ field enables users to clarify why they would like to see their suggestion accepted. The data that support their suggestion can be inserted in the ‘Supporting data’ field. Users can also indicate whether their suggestion is ELSST only (non-core) or is meant to be considered for HASSET also (core).

 

  1. Edit pages

‘Edit concept’ and ‘Edit translation’ pages

In order to implement the changes to the thesaurus described above, the edit pages for the source language (‘Edit concept’) and the edit pages for the translations (‘Edit translation’) have been revised and improved considerably. n the ‘Edit translation’ page, translators are given the ability to mark their translation as Untranslatable, or to change the parallel language to a language other than English. They can also add language-specific editorial notes. Finally, each note box offers extra edit facilities, such as bold, colour, insert links, preview, print etc.

 

  1. Documentation

New guides and training materials support the new Version of ELSST described above and will be made available behind the ‘Translators’ page.

Advertisements
Posted in Technical | Leave a comment

CESSDA ELSST New Release 19 September 2018

Lorna Balkan

We are pleased to announce the latest release of ELSST on 19 September 2018.

Changes that have been made to the source language of ELSST since the last release in September 2017 include the following:

  • 49 new concepts
  • 54 new Use For terms
  • 39 relabelled Preferred Terms (for currency or ambiguity)
  • 164 changes to the BT/NT relationships
  • 22 new or changed Scope Notes
  • 55 deleted concepts

The translation of Preferred Terms for the following languages is 98-100% complete: Czech, Danish, Finnish, French, German, Greek, Lithuanian, Norwegian, Romanian, Slovenian, Spanish and Swedish.

Since the previous release in September 2017 we have continued our restructuring work, focusing on the following hierarchies:

  • WELL-BEING (SOCIETY)
  • LAW AND JUSTICE
  • ENERGY
  • SPORT
  • SAFETY AND SECURITY

In each case, we tried to reduce the number of top terms (i.e. terms with no Broader Term) and orphan terms (i.e. terms with no Broader or Narrower Term), in order to make the thesaurus easier to browse. The removal of orphan terms will also make the thesaurus more SKOS-compliant.

We also reduced the number of polyhierarchies wherever possible (this work is ongoing), and applied the Related Term (RT) constraint rule that forbids an RT to be a term’s BT/NT or appear anywhere in its hierarchical structure. Many terms that were previously RTs can now be found by expanding a term’s Tree view.

Work on redistributing the information that was previously contained in scope notes into the new note fields (i.e. Scope note, Scope note source, Use note and History note) was completed for the source language. At the same time, ELSST and HASSET were brought closer into alignment by making their scope notes identical. Thus ELSST concepts that are shared with HASSET (identified as ‘core’ in the thesaurus), now share their Preferred Term label, Broader Terms and Scope Note with HASSET.

The redistribution of scope notes into the new note fields in the target languages is still ongoing.

Further information on the new changes can be found at Changes to ELSST.

Posted in Uncategorized | Leave a comment

SERISS thesaurus evaluation: final results

Lorna Balkan

Project aims

In recent years, ELSST has benefited from two strands of funding. Development work on the thesaurus content and software has been funded by the ESRC-funded CESSDA-ELSST project, and continues under the EU-funded CESSDA Vocabulary Services Multilingual Content Management (CESSDA VOICE) project. Complementary work on the assessment of the translation quality of terms has been carried out as part of the Synergies for Europe’s Research Infrastructures in the Social Sciences (SERISS) project, which is also funded by the EU. The SERISS work has now been completed.

The SERISS project investigated two methods for assessing the translation quality of ELSST terms. The first method was back-translation, which was performed on a subset of ELSST terms in two of the target languages: French and German. The work was very labour-intensive, but helped to identify cases where translations differed semantically and stylistically from the source terms. This was reported previously (see the First results of SERISS project).

The second quality assessment method, discussed here, involved comparing the set of index terms assigned to the same resource in different languages. Two types of resource were chosen: whole studies versus individual questions.

Whole-study indexing

For the whole-study indexing, a comparison was made of the ELSST terms used to index the same set of cross-national surveys in different languages. The indexing had been carried out previously by members of Consortium of European Social Science Data Archives (CESSDA ERIC) or CESSDA-related archives, according to their own indexing procedures. These procedures differed widely, with some archives assigning index terms at more granular levels than others. Consequently the results, discussed in First results of SERISS project, were difficult to compare. Moreover, whole study indexing produces an alphabetic list of terms where it is difficult to see which term relates to which part of a study.

Question indexing

Indexing individual questions is not currently practised by any of the CESSDA and CESSDA-related archives, but a small sample was selected and indexed as part of the SERISS project. The questions were taken from three surveys (the European Social Survey (ESS), the European Values Study (EVS) and the Survey of Health, Ageing and Retirement in Europe (SHARE)), and indexed with ELSST terms in German, Greek and Romanian. As before, the results were analysed, not only to compare how consistent the indexing was between indexers (and thereby uncover any potential problems with the terms or their translations), but also to see how well they covered the semantic content expressed in the questions.

This time, it was easier to see how the terms related to the object indexed (i.e. the question text) and to identify problems such as ambiguity (where a term’s meaning was not clear in the source and/or target language) and redundancy (where there was too great an overlap of meaning between two or more terms in the same language). However, other factors besides the properties of the terms themselves influenced the indexers’ choice of terms. In some cases, differences in how questions were worded in the different languages had an impact on the indexing terms chosen, and despite the fact that all indexers were using the same indexing instructions, each indexer interpreted them slightly differently. The experiment also revealed cases where the semantic content of the questions could not be adequately covered by ELSST terms, resulting in some new term suggestions. More details can be found in the Report on application of indexing terms in the data lifecycle.

Project impact

Overall, the SERISS work proved valuable in highlighting issues with ELSST terms and their translations. These issues cover semantic as well as more formal/stylistic aspects of terms. The results have been used to produce Guidelines for the management of ELSST content, which in turn has been used to update ELSST translation guidelines and training, and inform ongoing thesaurus development and translation work.

Besides improving the translation quality of ELSST terms, the SERISS work will also be of interest to those investigating how to index data, i.e. what to index (whole studies, questions and/or variables) and where in the data lifecycle indexing should be carried out.

Results from the SERISS thesaurus evaluation

Posted in Uncategorized | Leave a comment

New guidelines on ELSST content

Lorna Balkan

Overview

As part of the SERISS project, we have recently produced new guidelines for the management of ELSST content. These incorporate findings from the first results of the project which used back-translation to evaluate the translation quality of ELSST terms.

The new guidelines are aimed primarily at ELSST translators and content developers, but will also be of interest to end-users. They are divided into six parts, which cover the following topics:

  1. overview and background information on ELSST
  2. the main linguistic elements of ELSST
  3. management structure
  4. construction and maintenance of the thesaurus
  5. the translation process
  6. quality control issues

Part 1 covers the purpose, history ,scope and languages of ELSST, as well as versioning, access and copyright issues. Part 2 defines the various elements of ELSST, i.e. concepts, terms, relationships and notes. Part 3 describes the overall management structure, including the various roles and responsibilities, as well as the communication and decision making processes. Parts 4 and 5 explain how each element of the thesaurus described in Part 2 is constructed and translated. A list of vocabulary resources for the translation process is provided in an appendix. Part 6 discusses quality control issues, and includes a checklist aimed at both source language editors and translators.

The guidelines take account of ISO 25964-1, the latest international standard on thesaurus construction, published by the International Organization for Standardization, as well as the work of other knowledge organisation experts and thesaurus developers. Each part of the guidelines presents best practice for thesauri in general, followed by specific guidelines for ELSST.

Current and future uses

The new guidelines have already been used for training new ELSST translators and for briefing those who are using ELSST to index survey questions as part of ongoing work in the SERISS project. (The SERISS work consists of indexing questions from three cross-national surveys (the European Social Survey (ESS), the Survey of Health, Ageing and Retirement in Europe (SHARE) and the European Values Study (EVS) in three languages (German, Greek and Romanian) using ELSST terms from the relevant languages. The aim is to establish how well the ELSST terms match the content of the questions, and, at the same time, uncover any translation issues, either with the survey questions and/or the ELSST terms.)

The guidelines will also be used in the CESSDA-ELSST follow-up project, the Vocabulary Services Multilingual Content Management (VOICE) project, reported in the CESSDA-ELSST project update blog post. They will inform the new policy and procedural documents that will be required, as well as provide input for new training modules. Additional languages are planned for ELSST during the VOICE project (more details to follow). In order to increase its effectiveness, different modalities for providing translator training will be investigated.

Posted in Uncategorized | Leave a comment

CESSDA ELSST and the Taxonomy Boot Camp, 17-18 October 2017, Olympia London

The second edition of the Taxonomy Boot Camp took place in London on 17 & 18 October (http://www.taxonomybootcamp.com/London/2017/default.aspx). More than 200 participants attended the conference; 21 presentations were offered, most of them as part of parallel sessions. Some notes on what I experienced/attended.

The opening key note “Kick the beehive: new approaches to building taxonomies for the real world” by Madi Weland Solomon guided us into the real world of taxonomies and ontologies, text mining, auto-classification, graph databases and advance semantic technologies. Questions about those important issues and developments were introduced by different musical notes and lead to presents and gifts: a dynamic and entertaining session indeed.

Dave Clarke, CEO Synaptica, presented a best practice guide to how to jump-start your taxonomy project. Indispensable key concepts are the formal data models ISO 25964-1 (2011) and SKOS. Governance and management (create, edit, publish) together with Adopt-adept-create are important cornerstones, as are principles such as semantic integrity, interoperability and collaboration.

The session “Making sense of unstructured and large datasets” focused on search, navigation, discovery based on the notion “saviour machine” with regard to dealing/not dealing with text analytics and machine learning (Ahren Lehnert, Synaptica).  Martin Kaltenböck (Semantic Web) informed us in detail about the UN Climate Technology Centre and Network (CTCN), more specifically about knowledge graph implementation. Useful to know that Semantic web are the organisers of the SEMANTiCS conference (https://2017.semantics.cc/).

Interdisciplinarity, disambiguation, concept clarification, vocabulary alignment and the essential communication between taxonomists, librarians and field specialists formed the interesting issues discussed by Solveig Sørbø and Heidi Konestabo of the Norwegian Science Library at the University of Oslo. The important message taken from Roger Press (Academic Rights Press) was to pass data through several algorithms to derive the ranking order.

I took part in the session on “Working with large multi-faceted and multi-lingual taxonomies” with a presentation on “An in-depth view behind the scenes: the grammar, semantics and management of thesauri” (An email to Jeannine.beeken@essex.ac.uk should suffice to receive a copy of the PPT.)

The other slot was given to Beate Früh, Annette Weilandt and Silvia Giacomotti who talked about “Translation of taxonomies: challenges, methods and synergies”, in other words, about the three Ts: Taxonomy, Terminology and Translation.  Their work for Suva (Swiss National Accident Insurance Fund) involves four languages: German, French, Italian and English.

A lecture by Tom Reamy (KAPS Group) on the ever so important problem of fake news and how to try and unmask it by using text analytics ended the first day.

The second day was opened by Joseph Busch (Taxonomies Strategies & Semantic Staffing) with a very interesting keynote on “AI vs. automation: the newest technologies for automatic tagging”.  Busch addressed the issues of complete and consistent metadata (a computer performs better than a human indexer: 80% – 70%). Cloud computing, in other words “buying space”, should be encouraged as should automated tagging.  The latter deals with entity extraction (NER), keyword extraction, categorizers trained by examples, summarization (identifying key sentences). The open source infrastructure/software GATE/ANNIE (services.gate.ac.uk/annie) was recommended to process human language.  AI on the other hand is characterized by trained/statistical categorizers, such as IBM Watson NLP, Intellexer, Lexalytics. Certainly worth a much closer look and investigation.

In the next session “Collaborative working for website navigation project” Emma Maxim (Government Digital Service, GDS, UK) talked about revisiting the common UI offering an amalgamation of 4 or 5 different taxonomies at the same time (search, navigation, filters, types etc.) and replacing it by one taxonomy based on themes only on the 1st level and presenting the information in the form of concentric circles of the next levels. She also guided the audience through a best practice for user research: discovery (information gathering with large groups of users), validation (agile research sprints), beta-testing (set up success criteria before testing!) and dissemination. She recommended to use/develop machine learning algorithms to get a testable taxonomy. (More information on GDS developments can be found at https://gds.blog.gov.uk.).

“Semantic models in action” saw a presentation by Julia Barrott and Sukaina Bharwani (Stockholm Environment Institute) on “Visualising a harmonised language for climate services and disaster risk reduction”. Their main focus was on the visualisation of knowledge and knowledge discovery. Each term/concept/variable is identified by a tag/identity card/profile containing definitions, glossaries etc., which is clickable in tree structures.

Sabrina Wilske (Kantar TNS) discussed taxonomies for tweets, addressing the implementation of machine assisted algorithms and the linguistic problem of finding identifiers in tweets for abstract concepts (e.g. dehydration).

Veronique Malaisé (Elsevier), “From vocabulary requirements to a SKOS-XL model at Elsevier”, focussed on the RDF triples, comparing OWL, SKOS and SKOS-XL. While OWL defines specific properties for predicates, for example transitivity, in terms of “isa” relationships, SKOS expands its predicate options with for example meronymy. Using SHACKL for automatic checks is a plus, but there are no possibilities for inferences, adding a predicate to the labels themselves (for example acronyms).  In other words, all information is attached to the concepts. SKOS-XL however also allows information to be added to the labels and identifying inter-label relationships.

Cathy Dolbear (OUP) presented on “Automating the categorisation of academic content to a subject taxonomy” at the session “Taxonomy evaluation and maintenance”.  Her taxonomy contains 1500 nodes (SKOS-XML), 6 layers and is based on/used for manual indexing, allowing polyhierarchies. The next step is to classify at chapter and article level, improving the discoverability of the topic/about field. The corpus contains 396 journals, 30.624 books, 7.5 million items at chapter/article level.  Concerning retrieval, Dolbear’s advice was to place precision above recall. Tools such as PoolParty, Protégé and Scikit-learn are used.

The plenary session on “Language is rarely neutral: why the ethics of taxonomies matter”, led by Stella Dextre Clarke, brought some interesting issues to the surface with respect to fake news and gender neutrality/expansion, in other words, the importance of how we classify facts, things and people.

Posted in Conferences/workshops | Leave a comment

CESSDA-ELSST project update

The Thesaurus Team

The CESSDA-ELSST project, which has funded the development of the ELSST and HASSET thesauri over the last five years, officially ended on 30 September 2017.

The project, funded by the UK’s Economic and Social Research Council (ESRC), had three main aims:

  • merging and improving the (internal) management interface of both thesauri
  • updating and improving the (external) user interface of both thesauri
  • reviewing and updating the thesauri’s content

All three goals have been achieved.

New thesaurus management system

The new thesaurus management system, which brought the two thesauri onto the same development platform, was launched in January 2015. Since then, work has focused on streamlining the workflow for concept management and improving reporting functions.

Two new separate, but visually similar, user interfaces were also launched in January 2015. They enable the user to access HASSET and ELSST at https://hasset.ukdataservice.ac.uk/ and https://elsst.ukdataservice.ac.uk/ respectively. An innovative feature of both is the interactive visual graph view which provides an alternative way of navigating thesaurus terms and relationships rather than the standard form view.

Both the thesaurus management system and the user interfaces have undergone two stages of development, in response to user feedback, and the latest versions are due for release later this year.

Updated content

Work on developing the content of HASSET and ELSST has progressed in full collaboration with ELSST translators. Communication is conducted via a dedicated email list and quarterly online meetings. The input from all who have contributed has greatly enhanced the status of ELSST as a multilingual resource and the level of commitment shown by everyone is testament to the value placed on ELSST by both the individuals concerned and their institutions.

We are pleased to report that during the lifetime of the project, four new languages have been added to ELSST: Czech, Lithuanian and Romanian (in 2015) and Slovenian (in 2017). In addition, two new archives, the Luxembourg Institute of Socio-Economic Research (LISER), and the Austrian Social Science Data Archive (AuSSDA) have joined us to contribute to the translation of French and German respectively.

The project also achieved its goal of an annual release of ELSST through two releases (in September 2016 and September 2017).

Future funding

We are happy to announce that further funding has been secured for ELSST from January 2018 – December 2018 via the CESSDA Vocabulary Services Multilingual Content Management (VOICE) work plan. This work will involve the continuation of content development and extension to more languages (details to be confirmed). The UK Data Service will continue to lead the work, with FSD and GESIS as lead partners. The UK Data Service and many ELSST partners are also involved in other related CESSDA projects, including the Euro Question Bank (EQB), the CESSDA Metadata Management (CMM) project, and the Controlled Vocabularies (CV) Manager project. Further information about all these projects can be found at https://www.cessda.eu/Projects/Work-Plans

In the meantime, we would like to thank all those who helped make CESSDA-ELSST a success, and we look forward to working with you all in the next phase of ELSST’s development.

Posted in Uncategorized | Leave a comment

Translating ELSST into Slovenian

Sonja Bezjak and Irena Bolko

The challenge

The Slovenian Social Science Data Archives (ADP) are keen supporters of CESSDA and the cross-national harmonization of archives. We firmly believe that translating the ELSST thesaurus is an important step towards achieving this goal. However, as a small team, we lack the necessary resources to fully engage in the translation project.

In the past few years ADP has been liaising with the Slovenian Common Language Resources and Technology Infrastructure (CLARIN), sharing our knowledge and experience. For ADP, using digital language technologies offered a promising way to reduce the time and effort of the translation process

Automatic translation

Translating ELSST into Slovenian was carried out as a joint project consisting of two steps: automatic translation undertaken by a team of language technology experts, followed by manual editing of the translation by ADP with the support of terminology experts from the relevant subject domains.

In the first phase, the expert team selected and prepared several translation sources. The linguistic expert chose more general translation resources while ADP proposed subject dictionaries. Before translating, all terms and translations from the various translation sources were converted to upper case (as required by ELSST) and all plural-form ELSST source language terms were changed to singular to match the form in the translation sources.

Next, each whole English term was looked up in every translation source, and the results collated. Often the same translation was found in multiple sources. If no translations of the whole term were found, translations were constructed. English terms were subdivided and each subpart was translated independently. The translations of the subparts were then combined to produce a final Slovenian translation of the source term.

Manual editing

In the second phase, ADP team performed a manual check of the automatic translations, verifying and editing them if needed. This phase was subdivided into five tasks:

  • Choosing the best option among the translations produced from the various sources (as a result of the automatic translation)
  • Checking and highlighting the terms with potentially problematic translations (e.g. no appropriate translation, multiple options)
  • Checking the translations where issues were detected and seeking advice from subject experts
  • Consulting the linguistic expert on Slovenian grammar rules
  • Confirming the final list of translations

This was the first time that ELSST translation has been undertaken using semi-automatic translation. We believe that this allowed us not only to produce appropriate Slovenian translations but also to reduce our workload.

Further information

A more thorough explanation of the process described above and the algorithms used is beyond the scope of this blog post. However, should you be interested in reading more we are happy to hear from you and provide you with additional information. You can contact us either by replying to this blog post or by sending an email to
arhiv.podatkov@fdv.uni-lj.si.

See also CESSDA ELSST New Release 21 September 2017

 

Posted in Uncategorized | Leave a comment