New guidelines on ELSST content

Lorna Balkan


As part of the SERISS project, we have recently produced new guidelines for the management of ELSST content. These incorporate findings from the first results of the project which used back-translation to evaluate the translation quality of ELSST terms.

The new guidelines are aimed primarily at ELSST translators and content developers, but will also be of interest to end-users. They are divided into six parts, which cover the following topics:

  1. overview and background information on ELSST
  2. the main linguistic elements of ELSST
  3. management structure
  4. construction and maintenance of the thesaurus
  5. the translation process
  6. quality control issues

Part 1 covers the purpose, history ,scope and languages of ELSST, as well as versioning, access and copyright issues. Part 2 defines the various elements of ELSST, i.e. concepts, terms, relationships and notes. Part 3 describes the overall management structure, including the various roles and responsibilities, as well as the communication and decision making processes. Parts 4 and 5 explain how each element of the thesaurus described in Part 2 is constructed and translated. A list of vocabulary resources for the translation process is provided in an appendix. Part 6 discusses quality control issues, and includes a checklist aimed at both source language editors and translators.

The guidelines take account of ISO 25964-1, the latest international standard on thesaurus construction, published by the International Organization for Standardization, as well as the work of other knowledge organisation experts and thesaurus developers. Each part of the guidelines presents best practice for thesauri in general, followed by specific guidelines for ELSST.

Current and future uses

The new guidelines have already been used for training new ELSST translators and for briefing those who are using ELSST to index survey questions as part of ongoing work in the SERISS project. (The SERISS work consists of indexing questions from three cross-national surveys (the European Social Survey (ESS), the Survey of Health, Ageing and Retirement in Europe (SHARE) and the European Values Study (EVS) in three languages (German, Greek and Romanian) using ELSST terms from the relevant languages. The aim is to establish how well the ELSST terms match the content of the questions, and, at the same time, uncover any translation issues, either with the survey questions and/or the ELSST terms.)

The guidelines will also be used in the CESSDA-ELSST follow-up project, the Vocabulary Services Multilingual Content Management (VOICE) project, reported in the CESSDA-ELSST project update blog post. They will inform the new policy and procedural documents that will be required, as well as provide input for new training modules. Additional languages are planned for ELSST during the VOICE project (more details to follow). In order to increase its effectiveness, different modalities for providing translator training will be investigated.

Posted in Uncategorized | Leave a comment

CESSDA ELSST and the Taxonomy Boot Camp, 17-18 October 2017, Olympia London

The second edition of the Taxonomy Boot Camp took place in London on 17 & 18 October ( More than 200 participants attended the conference; 21 presentations were offered, most of them as part of parallel sessions. Some notes on what I experienced/attended.

The opening key note “Kick the beehive: new approaches to building taxonomies for the real world” by Madi Weland Solomon guided us into the real world of taxonomies and ontologies, text mining, auto-classification, graph databases and advance semantic technologies. Questions about those important issues and developments were introduced by different musical notes and lead to presents and gifts: a dynamic and entertaining session indeed.

Dave Clarke, CEO Synaptica, presented a best practice guide to how to jump-start your taxonomy project. Indispensable key concepts are the formal data models ISO 25964-1 (2011) and SKOS. Governance and management (create, edit, publish) together with Adopt-adept-create are important cornerstones, as are principles such as semantic integrity, interoperability and collaboration.

The session “Making sense of unstructured and large datasets” focused on search, navigation, discovery based on the notion “saviour machine” with regard to dealing/not dealing with text analytics and machine learning (Ahren Lehnert, Synaptica).  Martin Kaltenböck (Semantic Web) informed us in detail about the UN Climate Technology Centre and Network (CTCN), more specifically about knowledge graph implementation. Useful to know that Semantic web are the organisers of the SEMANTiCS conference (

Interdisciplinarity, disambiguation, concept clarification, vocabulary alignment and the essential communication between taxonomists, librarians and field specialists formed the interesting issues discussed by Solveig Sørbø and Heidi Konestabo of the Norwegian Science Library at the University of Oslo. The important message taken from Roger Press (Academic Rights Press) was to pass data through several algorithms to derive the ranking order.

I took part in the session on “Working with large multi-faceted and multi-lingual taxonomies” with a presentation on “An in-depth view behind the scenes: the grammar, semantics and management of thesauri” (An email to should suffice to receive a copy of the PPT.)

The other slot was given to Beate Früh, Annette Weilandt and Silvia Giacomotti who talked about “Translation of taxonomies: challenges, methods and synergies”, in other words, about the three Ts: Taxonomy, Terminology and Translation.  Their work for Suva (Swiss National Accident Insurance Fund) involves four languages: German, French, Italian and English.

A lecture by Tom Reamy (KAPS Group) on the ever so important problem of fake news and how to try and unmask it by using text analytics ended the first day.

The second day was opened by Joseph Busch (Taxonomies Strategies & Semantic Staffing) with a very interesting keynote on “AI vs. automation: the newest technologies for automatic tagging”.  Busch addressed the issues of complete and consistent metadata (a computer performs better than a human indexer: 80% – 70%). Cloud computing, in other words “buying space”, should be encouraged as should automated tagging.  The latter deals with entity extraction (NER), keyword extraction, categorizers trained by examples, summarization (identifying key sentences). The open source infrastructure/software GATE/ANNIE ( was recommended to process human language.  AI on the other hand is characterized by trained/statistical categorizers, such as IBM Watson NLP, Intellexer, Lexalytics. Certainly worth a much closer look and investigation.

In the next session “Collaborative working for website navigation project” Emma Maxim (Government Digital Service, GDS, UK) talked about revisiting the common UI offering an amalgamation of 4 or 5 different taxonomies at the same time (search, navigation, filters, types etc.) and replacing it by one taxonomy based on themes only on the 1st level and presenting the information in the form of concentric circles of the next levels. She also guided the audience through a best practice for user research: discovery (information gathering with large groups of users), validation (agile research sprints), beta-testing (set up success criteria before testing!) and dissemination. She recommended to use/develop machine learning algorithms to get a testable taxonomy. (More information on GDS developments can be found at

“Semantic models in action” saw a presentation by Julia Barrott and Sukaina Bharwani (Stockholm Environment Institute) on “Visualising a harmonised language for climate services and disaster risk reduction”. Their main focus was on the visualisation of knowledge and knowledge discovery. Each term/concept/variable is identified by a tag/identity card/profile containing definitions, glossaries etc., which is clickable in tree structures.

Sabrina Wilske (Kantar TNS) discussed taxonomies for tweets, addressing the implementation of machine assisted algorithms and the linguistic problem of finding identifiers in tweets for abstract concepts (e.g. dehydration).

Veronique Malaisé (Elsevier), “From vocabulary requirements to a SKOS-XL model at Elsevier”, focussed on the RDF triples, comparing OWL, SKOS and SKOS-XL. While OWL defines specific properties for predicates, for example transitivity, in terms of “isa” relationships, SKOS expands its predicate options with for example meronymy. Using SHACKL for automatic checks is a plus, but there are no possibilities for inferences, adding a predicate to the labels themselves (for example acronyms).  In other words, all information is attached to the concepts. SKOS-XL however also allows information to be added to the labels and identifying inter-label relationships.

Cathy Dolbear (OUP) presented on “Automating the categorisation of academic content to a subject taxonomy” at the session “Taxonomy evaluation and maintenance”.  Her taxonomy contains 1500 nodes (SKOS-XML), 6 layers and is based on/used for manual indexing, allowing polyhierarchies. The next step is to classify at chapter and article level, improving the discoverability of the topic/about field. The corpus contains 396 journals, 30.624 books, 7.5 million items at chapter/article level.  Concerning retrieval, Dolbear’s advice was to place precision above recall. Tools such as PoolParty, Protégé and Scikit-learn are used.

The plenary session on “Language is rarely neutral: why the ethics of taxonomies matter”, led by Stella Dextre Clarke, brought some interesting issues to the surface with respect to fake news and gender neutrality/expansion, in other words, the importance of how we classify facts, things and people.

Posted in Conferences/workshops | Leave a comment

CESSDA-ELSST project update

The Thesaurus Team

The CESSDA-ELSST project, which has funded the development of the ELSST and HASSET thesauri over the last five years, officially ended on 30 September 2017.

The project, funded by the UK’s Economic and Social Research Council (ESRC), had three main aims:

  • merging and improving the (internal) management interface of both thesauri
  • updating and improving the (external) user interface of both thesauri
  • reviewing and updating the thesauri’s content

All three goals have been achieved.

New thesaurus management system

The new thesaurus management system, which brought the two thesauri onto the same development platform, was launched in January 2015. Since then, work has focused on streamlining the workflow for concept management and improving reporting functions.

Two new separate, but visually similar, user interfaces were also launched in January 2015. They enable the user to access HASSET and ELSST at and respectively. An innovative feature of both is the interactive visual graph view which provides an alternative way of navigating thesaurus terms and relationships rather than the standard form view.

Both the thesaurus management system and the user interfaces have undergone two stages of development, in response to user feedback, and the latest versions are due for release later this year.

Updated content

Work on developing the content of HASSET and ELSST has progressed in full collaboration with ELSST translators. Communication is conducted via a dedicated email list and quarterly online meetings. The input from all who have contributed has greatly enhanced the status of ELSST as a multilingual resource and the level of commitment shown by everyone is testament to the value placed on ELSST by both the individuals concerned and their institutions.

We are pleased to report that during the lifetime of the project, four new languages have been added to ELSST: Czech, Lithuanian and Romanian (in 2015) and Slovenian (in 2017). In addition, two new archives, the Luxembourg Institute of Socio-Economic Research (LISER), and the Austrian Social Science Data Archive (AuSSDA) have joined us to contribute to the translation of French and German respectively.

The project also achieved its goal of an annual release of ELSST through two releases (in September 2016 and September 2017).

Future funding

We are happy to announce that further funding has been secured for ELSST from January 2018 – December 2018 via the CESSDA Vocabulary Services Multilingual Content Management (VOICE) work plan. This work will involve the continuation of content development and extension to more languages (details to be confirmed). The UK Data Service will continue to lead the work, with FSD and GESIS as lead partners. The UK Data Service and many ELSST partners are also involved in other related CESSDA projects, including the Euro Question Bank (EQB), the CESSDA Metadata Management (CMM) project, and the Controlled Vocabularies (CV) Manager project. Further information about all these projects can be found at

In the meantime, we would like to thank all those who helped make CESSDA-ELSST a success, and we look forward to working with you all in the next phase of ELSST’s development.

Posted in Uncategorized | Leave a comment

Translating ELSST into Slovenian

Sonja Bezjak and Irena Bolko

The challenge

The Slovenian Social Science Data Archives (ADP) are keen supporters of CESSDA and the cross-national harmonization of archives. We firmly believe that translating the ELSST thesaurus is an important step towards achieving this goal. However, as a small team, we lack the necessary resources to fully engage in the translation project.

In the past few years ADP has been liaising with the Slovenian Common Language Resources and Technology Infrastructure (CLARIN), sharing our knowledge and experience. For ADP, using digital language technologies offered a promising way to reduce the time and effort of the translation process

Automatic translation

Translating ELSST into Slovenian was carried out as a joint project consisting of two steps: automatic translation undertaken by a team of language technology experts, followed by manual editing of the translation by ADP with the support of terminology experts from the relevant subject domains.

In the first phase, the expert team selected and prepared several translation sources. The linguistic expert chose more general translation resources while ADP proposed subject dictionaries. Before translating, all terms and translations from the various translation sources were converted to upper case (as required by ELSST) and all plural-form ELSST source language terms were changed to singular to match the form in the translation sources.

Next, each whole English term was looked up in every translation source, and the results collated. Often the same translation was found in multiple sources. If no translations of the whole term were found, translations were constructed. English terms were subdivided and each subpart was translated independently. The translations of the subparts were then combined to produce a final Slovenian translation of the source term.

Manual editing

In the second phase, ADP team performed a manual check of the automatic translations, verifying and editing them if needed. This phase was subdivided into five tasks:

  • Choosing the best option among the translations produced from the various sources (as a result of the automatic translation)
  • Checking and highlighting the terms with potentially problematic translations (e.g. no appropriate translation, multiple options)
  • Checking the translations where issues were detected and seeking advice from subject experts
  • Consulting the linguistic expert on Slovenian grammar rules
  • Confirming the final list of translations

This was the first time that ELSST translation has been undertaken using semi-automatic translation. We believe that this allowed us not only to produce appropriate Slovenian translations but also to reduce our workload.

Further information

A more thorough explanation of the process described above and the algorithms used is beyond the scope of this blog post. However, should you be interested in reading more we are happy to hear from you and provide you with additional information. You can contact us either by replying to this blog post or by sending an email to

See also CESSDA ELSST New Release 21 September 2017


Posted in Uncategorized | Leave a comment

CESSDA ELSST New Release 21 September 2017

Jeannine Beeken

We are pleased to announce a new release of The European Language Social Science Thesaurus (ELSST). ELSST is now available in 13 languages: Czech, Danish, English, Finnish, French, German, Greek, Lithuanian, Norwegian, Romanian, Slovenian, Spanish and Swedish. The thesaurus offers information about almost 40.000 preferred terms and contains more than 33.000 non-preferred terms.

Since the previous release in 2016 a considerable amount of changes and improvements have been made throughout all languages. Details on the types of changes can be found at Changes to ELSST. The average percentage of translated preferred terms is 99%, with the majority of languages having a 100% coverage.

We are very proud to announce that Slovenian has been added as our 13th language. The Slovenian translations of all preferred terms have been provided by our colleagues at the Slovenian Social Science Data Archives (ADP). See also

Finally, we are pleased to announce that our colleagues at the Luxembourg Institute of Socio-Economic Research (LISER) have joined us to contribute to the French translations; our colleagues at the Austrian Social Science Data Archive (AuSSDA) have joined us to contribute to the German translation.

The current version of ELSST was released on 21 September 2017. The previous version dates from 6 September 2016.

Posted in Uncategorized | 1 Comment

Notes from symposium on language technology and translation

Lorna Balkan

Translation tools and technologies have gained increasing importance in the translation sector over the years, but until now have been little applied to the specific field of survey translation. To rectify this, the SERISS project held a symposium on “synergies between survey translation and developments in language and translation sciences” at University Pompeu Fabra (UPF) in Barcelona from 1-2 June 2017. The meeting was attended by delegates from the three main surveys involved in the SERISS project (the European Social Survey (ESS), Survey of Health, Ageing and Retirement in Europe (SHARE ERIC), and the European Values Study (EVS)), as well as by representatives of translation technology companies (cApStAn, Kantar Public and CentERdata) and by a leading academic in automated translation, Professor Toni Badia of UDF.

I was invited to report on my recent work on evaluating ELSST translation quality (see First results of SERISS project) and to consider prospects for automating the translation and evaluation of thesaurus terms.

Two general recommendations emerged from the symposium that are relevant to the translation of ELSST, as well as to survey translation.

First, it is important to analyse the whole life cycle of a product (not just the translation process) and to understand all the steps involved. Action taken prior to the translation phase has an impact on the translation process. We know this in ELSST, which is why we try, when creating new concepts, to choose source language labels and scope notes that will not present problems to the translators.

Second, it is critical to identify which steps a machine can perform better than humans. In the case of survey translation, this includes recognising questions that have not changed since the last wave of a survey, and which thus do not need to be retranslated. It also includes consistency checking in the quality assessment phase. Consistency checking would also be useful in ELSST, to make sure that source language terms that appear within other terms are given the same translation in the corresponding target language terms, where appropriate.

Delegates agreed that translation memories would be helpful to survey translation, since they store previously translated text which can be reused.The SERISS project is currently using CentERdata’s Translation Management Tool (TMT) for managing translation of its questionnaires. It is not particularly relevant to ELSST right now, but we shall see how it develops in the course of the project. The plan is to integrate it with a translation memory in the near future.

Toni Badia proposed that machine translation was mature enough to be able to offer a first draft of survey questions, if human resources were not available. He mentioned that a good starting point was phrase-based statistical machine translation systems, such as Moses, but noted that the paradigm is shifting towards neural machine translation which promises better quality. This is certainly something we could also consider for ELSST.

Another recommendation of the symposium was for translators of the different surveys to collaborate with each other. Each has a list of well-known translation problems. These problems would interest ELSST developers also, so we shall ask to be included in any future collaboration.

Posted in Uncategorized | Leave a comment

Observatory for Knowledge Organisation Systems workshop, Malta, 2017

Suzanne Barbalet

The Observatory for Knowledge Organisation Systems workshop, an event of KNOWeSCAPE – Analyzing the Dynamics of Information and Knowledge Landscapes, met in Malta from 1-3 February this year. KNOWeSCAPE is funded under the European Cooperation in Science and Technology (COST) framework.

Based on a European intergovernmental framework for cooperation in science and technology, COST supports trans-national cooperation across all fields in science and technology, including social sciences and humanities, through pan-European networking of nationally funded research activities. Contributors to this event were indeed representative of such a cross-section of Knowledge Organisation System (KOS) users and developers.

Philipp Mayr from the GESIS department of Knowledge Technologies for the Social Sciences (WTS) reported on case studies using two KOS applications for query expansion for the platforms, Sowiport, and Social Science Open Access Repository (SSOAR) that his team manages. These platforms cover bibliographic information and full text in the social sciences. When the two KOS applications were mapped to one another by subject specialists it was found that the use of more than one KOS tool improved the user experience.

A presentation by Kalpana Shankar reported on a study of two data archives, UK Data Archive, University of Essex and Inter-university Consortium for Political and Social Research (ICPSR), University of Michigan, with the aim of raising larger questions about data sustainability. Still a work in progress, it promises to be an important piece of research on the history of social science data archives and their impact on the social sciences in the latter part of the twentieth century.

An interesting talk by Paul Groth, Elsevier, Netherlands considered how the process of constructing a KOS might be changing with the incorporation of software agents and non-professional contributors and suggested a role for a KOS observatory to engage these issues.

On the second day the workshop concluded with an interesting question and answer session on the collaborative process used in the development of Wikipedia content. A strong thread in the debate that ensued was the role a national Wikipedia may play in the preservation of cultural heritage.

In the first presentation of the cataloguing session Jan Kozlowski challenged us to consider the importance of providing metadata for modern European manuscripts. Two further presentations were on the schema Universal Decimal Classification (UDC). In her presentation Universal Knowledge Classifications: From Linking Information to Linked Data Aida Slavic, stressed the importance of classification to ensure full retrieval for those instances where we cannot afford not to find everything. This argument was similar to that made by Patrick Lambe at the ISKO UK conference 2015 and reported on in ELSST Development and News.

A presentation by Peter Hook from Wayne State University entitled Visualizing Knowledge Organization Systems provided some interesting insights that have application for a project I have been piloting using UDC to manage the content of subject categories or topics.

I am particularly grateful to Andrea Scharnhorst of Data Archiving and Networked Services (DANS), Netherlands and Aida Slavic, editor-in-chief of the UDC Consortium for the invitation to attend.

Posted in Conferences/workshops | Leave a comment