Observatory for Knowledge Organisation Systems workshop, Malta, 2017

Suzanne Barbalet

The Observatory for Knowledge Organisation Systems workshop, an event of KNOWeSCAPE – Analyzing the Dynamics of Information and Knowledge Landscapes, met in Malta from 1-3 February this year. KNOWeSCAPE is funded under the European Cooperation in Science and Technology (COST) framework.

Based on a European intergovernmental framework for cooperation in science and technology, COST supports trans-national cooperation across all fields in science and technology, including social sciences and humanities, through pan-European networking of nationally funded research activities. Contributors to this event were indeed representative of such a cross-section of Knowledge Organisation System (KOS) users and developers.

Philipp Mayr from the GESIS department of Knowledge Technologies for the Social Sciences (WTS) reported on case studies using two KOS applications for query expansion for the platforms, Sowiport, and Social Science Open Access Repository (SSOAR) that his team manages. These platforms cover bibliographic information and full text in the social sciences. When the two KOS applications were mapped to one another by subject specialists it was found that the use of more than one KOS tool improved the user experience.

A presentation by Kalpana Shankar reported on a study of two data archives, UK Data Archive, University of Essex and Inter-university Consortium for Political and Social Research (ICPSR), University of Michigan, with the aim of raising larger questions about data sustainability. Still a work in progress, it promises to be an important piece of research on the history of social science data archives and their impact on the social sciences in the latter part of the twentieth century.

An interesting talk by Paul Groth, Elsevier, Netherlands considered how the process of constructing a KOS might be changing with the incorporation of software agents and non-professional contributors and suggested a role for a KOS observatory to engage these issues.

On the second day the workshop concluded with an interesting question and answer session on the collaborative process used in the development of Wikipedia content. A strong thread in the debate that ensued was the role a national Wikipedia may play in the preservation of cultural heritage.

In the first presentation of the cataloguing session Jan Kozlowski challenged us to consider the importance of providing metadata for modern European manuscripts. Two further presentations were on the schema Universal Decimal Classification (UDC). In her presentation Universal Knowledge Classifications: From Linking Information to Linked Data Aida Slavic, stressed the importance of classification to ensure full retrieval for those instances where we cannot afford not to find everything. This argument was similar to that made by Patrick Lambe at the ISKO UK conference 2015 and reported on in ELSST Development and News.

A presentation by Peter Hook from Wayne State University entitled Visualizing Knowledge Organization Systems provided some interesting insights that have application for a project I have been piloting using UDC to manage the content of subject categories or topics.

I am particularly grateful to Andrea Scharnhorst of Data Archiving and Networked Services (DANS), Netherlands and Aida Slavic, editor-in-chief of the UDC Consortium for the invitation to attend.

First results of SERISS project

Lorna Balkan

Synergies for Europe’s Research Infrastructures in the Social Sciences (SERISS) is a four-year project (2015-2019) funded by the European Commission as part of its Horizon 2020 programme. It aims to foster collaboration and develop shared standards between the three leading European research infrastructures in the social sciences – the European Social Survey (ESS ERIC), the Survey of Health, Ageing and Retirement in Europe (SHARE ERIC), and the Consortium of European Social Science Data Archives (CESSDA AS) – and organisations representing the Generations and Gender Programme (GGP), European Values Study (EVS) and the WageIndicator Survey. Work focuses on three key areas: addressing key challenges for cross-national data collection, breaking down barriers between social science infrastructures, and embracing the future of the social sciences.

The first results of the project are now available online. These include D3.9: Report on findings from re-translation of ELSST terms and their use in the CESSDA Portal reporting the work done by the UK Data Service. The deliverable describes two methods that were used to assess the translation quality of ELSST terms, and is in two parts.


The first part describes the evaluation of a subset of ELSST French and German terms (1000 from each language) using the re-translation (or more precisely, the back-translation) method. The French and German terms were back-translated into the source language (English) and differences between the back-translations and the original source language terms were then analysed. This resulted in a classification of error types, and a number of recommendations. The deliverable shows that, while the back-translation method was useful in highlighting some issues with the thesaurus that affect both its ‘semantic adequacy’ (i.e. how adequate terms are from a semantics point of view) and its ‘formal adequacy’ (i.e. the extent to which terms conform to ELSST Translation Guidelines), it has nothing to say about ‘pragmatic adequacy’ (i.e. how acceptable terms are to users), or how the terms would function in an operational setting. Back-translation should, therefore, be seen as one of several complementary evaluation methods.

ELSST in use

One such complementary evaluation method is to compare the sets of terms that have been used to index the same resources, to see if differences are due to differences in how the terms have been interpreted, indicating unintended ambiguity in either the source or target terms. This approach is explored in the second part of the deliverable which compares the sets of ELSST terms that have been used to index specific cross-national surveys. The original plan was to use the CESSDA portal to find such studies, but this had to be revised since the portal has not been operational for some time. Instead, CESSDA-ELSST partners were asked via a questionnaire how they index a set of cross-national surveys. Many thanks to all who responded. Differences in the sets of terms assigned to each survey were then analysed. Results showed that, due to the paucity of the data and the differences in indexing practices across archives, it was not possible to draw any firm conclusions on the quality of the translation. However, the work highlighted ways in which ELSST could be better exploited within the archives.

Next steps

The results of the evaluation work described above will feed into ongoing work on HASSET and ELSST within the CESSDA-ELSST project, and into the next goals of the SERISS project. SERISS goals in the next phase (to June 2017) include updating the ELSST translation guidelines and producing the next deliverable: D3.10 ‘Best practice document on translation and use of thesaurus terms’. This work will be produced in consultation with CESSDA-ELSST partners.

Related SERISS work

Complementary work within SERISS is looking at how to improve the translation quality of the questionnaires in cross-national surveys. Different approaches to questionnaire translation are being investigated, including how computational linguistic methods could be exploited. A workshop on this last topic is planned in the near future.


If you have any comments on this blog, please either add them below, or send them to the UK Data Service Thesaurus Team at thesaurus@ukdataservice.ac.uk

Showcasing taxonomies and their uses

Lorna Balkan

The first Taxonomy Boot Camp London was held in at the Olympia Conference Centre, London from 18-19 November. It was a chance for experts and novices alike to find out about the latest developments in taxonomies and their uses. It was also a chance to present first findings on the evaluation work I have been doing on ELSST within the context of the SERISS project – see my presentation on ‘Using back-translation for quality control in a multilingual thesaurus’.

The boot camp attracted an international audience from the academic, charity, and corporate sectors. ‘Taxonomy’ was understood in a general sense to cover thesauri, ontologies, and other less formal knowledge organization systems (KOSs). Presentations covered the design and construction of taxonomies, application areas such as search and corporate information/knowledge management, and software tools.

Keynote speakers were Mike Atherton, Content Strategist at Facebook, UK and Patrick Lambe, a well-known taxonomist from Straits Knowledge. Atherton argued that underpinning a website with a domain model makes it easier to maintain and update. Lambe showed how taxonomies can be used to organize corporate knowledge (including processes and procedures) as well as corporate information (data).

Taxonomies in theory

Heather Hedden, the author of The Accidental Taxonomist, gave an interesting presentation on non-preferred terms. She discussed how, in the term-based thesaurus model represented by ISO 25964-1 (and found in ELSST), non-preferred terms stand in an equivalence relationship to their preferred terms, while in the concept-based Simple Knowledge Organization System (SKOS) model, they are attributes of concepts. Stella Dextre Clarke, co-author of ISO 25964-1, pointed out that both standards were developed in close collaboration and are largely compatible. While ISO 25964-1 is built for thesauri, she noted, SKOS is designed to accommodate other KOSs also.

Taxonomies in practice

A practical session entitled ‘working with multidisciplinary teams – taxonomy tales from the trenches’ was led by experienced taxonomists and offered practical tips for how to negotiate with other stakeholders (project managers, IT specialists, etc.) in the construction of a taxonomy. Of particular relevance to ELSST, in the context of the evaluation work being undertaken in the SERISS project, was Lambe’s warning not to present your taxonomy to an ‘expert’ and ask them what they think of it. Experts should, he argued, be consulted about specific questions only.

Tools and technology

Many different software tools for constructing and managing taxonomies were presented and discussed. Common themes included using automatic indexing and/or crowd-sourcing to populate or improve taxonomies. Many speakers also stressed the importance of semantic technologies (SKOS, RDF, and Linked Open Data (LOD) ) which allow mappings to other vocabularies and LOD. For example, Roger Press of Academic Rights Press Ltd, described musicweb, a music portal that exploits LOD to ‘discover’ information about artists that is not otherwise available.

Andreas Blumauer of PoolParty described the advantages of graph databases over traditional databases, including their ability to map to data and documents stored in other systems and databases, and to support more complex queries.

In short, the need for taxonomies, and for people who know how to construct them, looks stronger than ever.

CESSDA-ELSST New Release 6 September 2016

The European Language Social Science Thesaurus (ELSST) is available in 12 languages: Czech, Danish, English, Finnish, French, German, Greek, Lithuanian, Norwegian, Romanian, Spanish and Swedish.

We are proud to announce that since September 2016 the average of all translated Preferred Terms, including all Broader Terms, Narrower Terms and Related Terms is 98%, where most languages are fully up-to-date with the source language, containing 100% translated PTs. The number of updates and improvements made throughout all 12 languages ranges from 150 to over 700 per language, with an average of 350.

We are also delighted to announce that our Swiss colleagues at FORS have completely reviewed and consistently applied all diacritics to their French translations.

Finally, each preferred term (PT) and its translations have a link to their equivalent SKOS Concept.

Details on the types of changes made since the previous release January 2015 can be found on Changes to ELSST

Embracing the 'Data Revolution’: IASSIST Conference 2016

Suzanne Barbalet

Summer began early for many of us who had travelled to Bergen to attend IASSIST 2016 held from 31 May to 3 June and we were grateful for the wonderful hospitality provided by Norwegian Centre for Research Data (NSD).

CESSDA colleagues were well-represented at this year’s conference. The opportunity to meet Taina Jääskeläinen, Gry Henriksen and Irena Vipavc Brvar was particularly welcome.

Source: John Shepherdson. Bjorn Henrichsen, Director of NSD, with Heidi Tvedt and Gry Henriksen

The IASSIST Blog reports on the conference and provides an excellent summary of a selection of presentations in the parallel sessions. Some CESSDA work not mentioned in this blog includes a poster on a CESSDA Work Plan Task presented by Anne Etheridge and her co-authors, Wolfgang Zenk-Möltgen and Mari Kleemola, and John Shepherdson’s outline of the forthcoming CESSDA Research Infrastructure (CRI).

In the parallel sessions entitled ‘Big Data, Big Science’ Aidan Condron and I presented complementary papers. Aidan introduced the UK Data Service’s ‘big data’ architecture. The Service is in the process of designing an open data platform for social science which is implemented through a data lake. When complete it will enable social scientists to analyse resources ranging from large and complex datasets to combinations of data sources. Taking up a thread in Matthew Woollard’s IASSIST plenary address, in which he said that it is most useful now to talk about ‘new and novel’ forms of data, Aidan advocated taking an expansive view of how, when, and where data can be collected, stored, linked and analysed. Using a case study from the project Smarter Household Energy Data he demonstrated how exploratory data analysis methods could be employed to prepare the data for use within the social science research community.

My presentation, co-authored by Nathan Cunningham, focused on the requirements for a vocabulary service to augment this open data platform. Such a vocabulary service, it was proposed, could benefit from the use of a classification scheme to organise subject access for the purpose of exploratory data analysis. An application of the Universal Decimal Classification Scheme (UDC) had been trialled within the Archive as a tool to manage subject categories. Aida Slavic, the editor of UDC, has argued that ‘free text’ searching abated the interest in classification throughout the 1980s and 1990s (Slavic, 2008), but notes that the advent of subject gateways somewhat reversed this trend by using classification schemes to support mapping between different indexing systems (Slavic, 2006). These models inspired a trial application of UDC by the UK Data Archive.

The trial demonstrated that legacy classification is not a difficult task. For the same reasons that ‘free text’ searches are successful in the retrieval of important social science research concepts, a generic description of the research topic, via title and abstract, enables the subject content to be quickly classified. We reported on the trial and outlined an application for the use of UDC to support a vocabulary service to augment this open data platform.

Data librarians were well represented at this year’s conference and references were made to the forthcoming publication: Databrarianship: The Academic Data Librarian in Theory and Practice, which promises to be of interest to all colleagues.

SERISS project

Lorna Balkan

The UK Data Service has recently received funding to participate in an EU-funded project that complements the work that is being carried out within the CESDDA-ELSST project. The Synergies for Europe’s Research Infrastructures in the Social Sciences (SERISS) project is funded as part of the European Union’s EU Research and Innovation programme, Horizon 2020 and runs from 1 July 2015 – 30 June 2019. The kick-off meeting was held in London on 21-22 September 2015.

SERISS aims to address some of the key challenges for cross-national data collection, including facilitating greater harmonisation of data collection, analysis and curation across social science infrastructures. Work package 3: ‘Maximising equivalence through translation’, will investigate different approaches to the translation of questionnaires, and conduct a comparative empirical assessment of thesaurus keywords. This latter task, which the UK Data Service Thesaurus Team will be undertaking, will evaluate ELSST and compare the translation quality of ELSST concepts with that used in questions in surveys conducted by the European Social Survey (ESS) or Survey of Health, Ageing and Retirement in Europe (SHARE).

Part of the investigation of survey translation will look at the feasibility of applying computational linguistic methods to survey translation. It will be interesting to see if any of these tools can be applied to the translation of thesauri.

Given time restrictions, the evaluation of ELSST will focus on the French and German translations only. The chosen evaluation method is back-translation – the target language translations will be back-translated, the results will be compared with the original Source Language (i.e. English) terms from which the translations were made, and any differences noted. The CESSDA portal will also be interrogated to identify multilingual surveys indexed with the translated terms, and the terms applied to these surveys will be compared with the Source Language terms and any differences identified. Ambiguous terms will be reviewed and corrected to make their meaning clear, as is happening already within the CESSDA-ELSST project.

Based on SERISS findings, ELSST translation guidelines will be reviewed and a set of best practice guidance established.

A further quality assurance process will be carried out later in the project when the preferred term labels and meanings of ELSST concepts will be compared with the translated question content of surveys to identify exact and partial matches in terms of meaning.

Other CESSDA partners are also involved in other parts of the project – ADP (Slovenia), CSDA (Czech Republic), FORS (Switzerland), GESIS (Germany), and NSD (Norway). We look forward to fruitful collaboration.

Feedback welcome
If anyone has any comments on the thesaurus evaluation task, we would be glad to hear them. Please either add your comment to this blog, or send them to the UK Data Service Thesaurus Team at thesaurus@ukdataservice.ac.uk

Classification and Authority Control: UDC Seminar 2015

Suzanne Barbalet

Subject authority control, a once taken-for-granted principle of resource discovery, is a somewhat neglected topic that the recent Biennial International UDC Seminar in Lisbon raised to prominence at a two-day meeting on the 29th-30th October 2015.

One hundred delegates from twenty six countries, many of them familiar faces at ISKO events, attended the Seminar. The programme comprised twenty presentations and six posters. Familiar names such as Marcia Zeng, Dagobert Soergel and Douglas Tudhope contributed to the presentations.

Barbara Tillett’s address reflected a lifetime’s work at the Library of Congress. Particularly topical in her discussion were mapping issues. She referred to the MACS project (Multilingual Access to Subjects) and though one-to-one mapping could never be systematically achieved, she observed, an interesting fact was that the range of terms assigned from the multiple language systems provided useful suggestions for users to explore concepts.

The editor of Dewey Decimal Classification, Rebecca Green, and the editor of Universal Decimal Classification Aida Slavic, spoke on the many applications of powerful notational classification schemas that have been employed world-wide for over a century and their potential for applications of linked data. Notation transcends language barriers and simplifies some information challenges posed by linked data it was argued, however it requires leverage to better integrate it into the current information environment.

Of particular interest to thesauri users is the work of Andreas Ledl from the University of Basel, Switzerland and his ambitious register of thesauri, ontologies and classifications.

The work of Ulf Schoneberg and Wolfram Sperber from the FIZ Karlsruhe – Leibniz-Insitut fur Informationinfrastruktur which adapted machine learning methods to meet the specific requirements of mathematical information appeared to have exciting implications for automatic indexing of variables

The potential of automatic classification software was explored by Attila Piros and a method of evaluating automatic subject indexing and classification was presented by Koralijka Golub, Joacim Hansson, Dagobert Soergel and Douglas Tudhope.

We presented results of our research into an application of the Universal Decimal Classification scheme for addressing the relationship between the DDI fields of <subject> and <keyword>. In a recent pilot study at the UK Data Archive the UDC schema proved economic to use for the classification of studies. The potential of classification schemes to work in a linked data environment underpins this work.

A unifying theme of the meeting was that at a time when subject cataloguers and indexers are under continuous pressure to justify the value of their work, success stories in subject retrieval can be linked to both the use of classification and the availability of subject control tools. The quality of these tools, as we are currently experiencing in our thesauri revision work, relies heavily on the expert knowledge of subject cataloguers and indexers.

