Notes from symposium on language technology and translation

Lorna Balkan

Translation tools and technologies have gained increasing importance in the translation sector over the years, but until now have been little applied to the specific field of survey translation. To rectify this, the SERISS project held a symposium on “synergies between survey translation and developments in language and translation sciences” at University Pompeu Fabra (UPF) in Barcelona from 1-2 June 2017. The meeting was attended by delegates from the three main surveys involved in the SERISS project (the European Social Survey (ESS), Survey of Health, Ageing and Retirement in Europe (SHARE ERIC), and the European Values Study (EVS)), as well as by representatives of translation technology companies (cApStAn, Kantar Public and CentERdata) and by a leading academic in automated translation, Professor Toni Badia of UDF.

I was invited to report on my recent work on evaluating ELSST translation quality (see First results of SERISS project) and to consider prospects for automating the translation and evaluation of thesaurus terms.

Two general recommendations emerged from the symposium that are relevant to the translation of ELSST, as well as to survey translation.

First, it is important to analyse the whole life cycle of a product (not just the translation process) and to understand all the steps involved. Action taken prior to the translation phase has an impact on the translation process. We know this in ELSST, which is why we try, when creating new concepts, to choose source language labels and scope notes that will not present problems to the translators.

Second, it is critical to identify which steps a machine can perform better than humans. In the case of survey translation, this includes recognising questions that have not changed since the last wave of a survey, and which thus do not need to be retranslated. It also includes consistency checking in the quality assessment phase. Consistency checking would also be useful in ELSST, to make sure that source language terms that appear within other terms are given the same translation in the corresponding target language terms, where appropriate.

Delegates agreed that translation memories would be helpful to survey translation, since they store previously translated text which can be reused.The SERISS project is currently using CentERdata’s Translation Management Tool (TMT) for managing translation of its questionnaires. It is not particularly relevant to ELSST right now, but we shall see how it develops in the course of the project. The plan is to integrate it with a translation memory in the near future.

Toni Badia proposed that machine translation was mature enough to be able to offer a first draft of survey questions, if human resources were not available. He mentioned that a good starting point was phrase-based statistical machine translation systems, such as Moses, but noted that the paradigm is shifting towards neural machine translation which promises better quality. This is certainly something we could also consider for ELSST.

Another recommendation of the symposium was for translators of the different surveys to collaborate with each other. Each has a list of well-known translation problems. These problems would interest ELSST developers also, so we shall ask to be included in any future collaboration.

Posted in Uncategorized | Leave a comment

Observatory for Knowledge Organisation Systems workshop, Malta, 2017

Suzanne Barbalet

The Observatory for Knowledge Organisation Systems workshop, an event of KNOWeSCAPE – Analyzing the Dynamics of Information and Knowledge Landscapes, met in Malta from 1-3 February this year. KNOWeSCAPE is funded under the European Cooperation in Science and Technology (COST) framework.

Based on a European intergovernmental framework for cooperation in science and technology, COST supports trans-national cooperation across all fields in science and technology, including social sciences and humanities, through pan-European networking of nationally funded research activities. Contributors to this event were indeed representative of such a cross-section of Knowledge Organisation System (KOS) users and developers.

Philipp Mayr from the GESIS department of Knowledge Technologies for the Social Sciences (WTS) reported on case studies using two KOS applications for query expansion for the platforms, Sowiport, and Social Science Open Access Repository (SSOAR) that his team manages. These platforms cover bibliographic information and full text in the social sciences. When the two KOS applications were mapped to one another by subject specialists it was found that the use of more than one KOS tool improved the user experience.

A presentation by Kalpana Shankar reported on a study of two data archives, UK Data Archive, University of Essex and Inter-university Consortium for Political and Social Research (ICPSR), University of Michigan, with the aim of raising larger questions about data sustainability. Still a work in progress, it promises to be an important piece of research on the history of social science data archives and their impact on the social sciences in the latter part of the twentieth century.

An interesting talk by Paul Groth, Elsevier, Netherlands considered how the process of constructing a KOS might be changing with the incorporation of software agents and non-professional contributors and suggested a role for a KOS observatory to engage these issues.

On the second day the workshop concluded with an interesting question and answer session on the collaborative process used in the development of Wikipedia content. A strong thread in the debate that ensued was the role a national Wikipedia may play in the preservation of cultural heritage.

In the first presentation of the cataloguing session Jan Kozlowski challenged us to consider the importance of providing metadata for modern European manuscripts. Two further presentations were on the schema Universal Decimal Classification (UDC). In her presentation Universal Knowledge Classifications: From Linking Information to Linked Data Aida Slavic, stressed the importance of classification to ensure full retrieval for those instances where we cannot afford not to find everything. This argument was similar to that made by Patrick Lambe at the ISKO UK conference 2015 and reported on in ELSST Development and News.

A presentation by Peter Hook from Wayne State University entitled Visualizing Knowledge Organization Systems provided some interesting insights that have application for a project I have been piloting using UDC to manage the content of subject categories or topics.

I am particularly grateful to Andrea Scharnhorst of Data Archiving and Networked Services (DANS), Netherlands and Aida Slavic, editor-in-chief of the UDC Consortium for the invitation to attend.

Posted in Conferences/workshops | Leave a comment

First results of SERISS project

Lorna Balkan

Synergies for Europe’s Research Infrastructures in the Social Sciences (SERISS) is a four-year project (2015-2019) funded by the European Commission as part of its Horizon 2020 programme. It aims to foster collaboration and develop shared standards between the three leading European research infrastructures in the social sciences – the European Social Survey (ESS ERIC), the Survey of Health, Ageing and Retirement in Europe (SHARE ERIC), and the Consortium of European Social Science Data Archives (CESSDA AS) – and organisations representing the Generations and Gender Programme (GGP), European Values Study (EVS) and the WageIndicator Survey. Work focuses on three key areas: addressing key challenges for cross-national data collection, breaking down barriers between social science infrastructures, and embracing the future of the social sciences.

The first results of the project are now available online. These include D3.9: Report on findings from re-translation of ELSST terms and their use in the CESSDA Portal reporting the work done by the UK Data Service. The deliverable describes two methods that were used to assess the translation quality of ELSST terms, and is in two parts.

Back-translation

The first part describes the evaluation of a subset of ELSST French and German terms (1000 from each language) using the re-translation (or more precisely, the back-translation) method. The French and German terms were back-translated into the source language (English) and differences between the back-translations and the original source language terms were then analysed. This resulted in a classification of error types, and a number of recommendations. The deliverable shows that, while the back-translation method was useful in highlighting some issues with the thesaurus that affect both its ‘semantic adequacy’ (i.e. how adequate terms are from a semantics point of view) and its ‘formal adequacy’ (i.e. the extent to which terms conform to ELSST Translation Guidelines), it has nothing to say about ‘pragmatic adequacy’ (i.e. how acceptable terms are to users), or how the terms would function in an operational setting. Back-translation should, therefore, be seen as one of several complementary evaluation methods.

ELSST in use

One such complementary evaluation method is to compare the sets of terms that have been used to index the same resources, to see if differences are due to differences in how the terms have been interpreted, indicating unintended ambiguity in either the source or target terms. This approach is explored in the second part of the deliverable which compares the sets of ELSST terms that have been used to index specific cross-national surveys. The original plan was to use the CESSDA portal to find such studies, but this had to be revised since the portal has not been operational for some time. Instead, CESSDA-ELSST partners were asked via a questionnaire how they index a set of cross-national surveys. Many thanks to all who responded. Differences in the sets of terms assigned to each survey were then analysed. Results showed that, due to the paucity of the data and the differences in indexing practices across archives, it was not possible to draw any firm conclusions on the quality of the translation. However, the work highlighted ways in which ELSST could be better exploited within the archives.

Next steps

The results of the evaluation work described above will feed into ongoing work on HASSET and ELSST within the CESSDA-ELSST project, and into the next goals of the SERISS project. SERISS goals in the next phase (to June 2017) include updating the ELSST translation guidelines and producing the next deliverable: D3.10 ‘Best practice document on translation and use of thesaurus terms’. This work will be produced in consultation with CESSDA-ELSST partners.

Related SERISS work

Complementary work within SERISS is looking at how to improve the translation quality of the questionnaires in cross-national surveys. Different approaches to questionnaire translation are being investigated, including how computational linguistic methods could be exploited. A workshop on this last topic is planned in the near future.

Feedback

If you have any comments on this blog, please either add them below, or send them to the UK Data Service Thesaurus Team at thesaurus@ukdataservice.ac.uk

Posted in Uncategorized | Leave a comment

Showcasing taxonomies and their uses

Lorna Balkan

The first Taxonomy Boot Camp London was held in at the Olympia Conference Centre, London from 18-19 November. It was a chance for experts and novices alike to find out about the latest developments in taxonomies and their uses. It was also a chance to present first findings on the evaluation work I have been doing on ELSST within the context of the SERISS project – see my presentation on ‘Using back-translation for quality control in a multilingual thesaurus’.

The boot camp attracted an international audience from the academic, charity, and corporate sectors. ‘Taxonomy’ was understood in a general sense to cover thesauri, ontologies, and other less formal knowledge organization systems (KOSs). Presentations covered the design and construction of taxonomies, application areas such as search and corporate information/knowledge management, and software tools.

Keynote speakers were Mike Atherton, Content Strategist at Facebook, UK and Patrick Lambe, a well-known taxonomist from Straits Knowledge. Atherton argued that underpinning a website with a domain model makes it easier to maintain and update. Lambe showed how taxonomies can be used to organize corporate knowledge (including processes and procedures) as well as corporate information (data).

Taxonomies in theory

Heather Hedden, the author of The Accidental Taxonomist, gave an interesting presentation on non-preferred terms. She discussed how, in the term-based thesaurus model represented by ISO 25964-1 (and found in ELSST), non-preferred terms stand in an equivalence relationship to their preferred terms, while in the concept-based Simple Knowledge Organization System (SKOS) model, they are attributes of concepts. Stella Dextre Clarke, co-author of ISO 25964-1, pointed out that both standards were developed in close collaboration and are largely compatible. While ISO 25964-1 is built for thesauri, she noted, SKOS is designed to accommodate other KOSs also.

Taxonomies in practice

A practical session entitled ‘working with multidisciplinary teams – taxonomy tales from the trenches’ was led by experienced taxonomists and offered practical tips for how to negotiate with other stakeholders (project managers, IT specialists, etc.) in the construction of a taxonomy. Of particular relevance to ELSST, in the context of the evaluation work being undertaken in the SERISS project, was Lambe’s warning not to present your taxonomy to an ‘expert’ and ask them what they think of it. Experts should, he argued, be consulted about specific questions only.

Tools and technology

Many different software tools for constructing and managing taxonomies were presented and discussed. Common themes included using automatic indexing and/or crowd-sourcing to populate or improve taxonomies. Many speakers also stressed the importance of semantic technologies (SKOS, RDF, and Linked Open Data (LOD) ) which allow mappings to other vocabularies and LOD. For example, Roger Press of Academic Rights Press Ltd, described musicweb, a music portal that exploits LOD to ‘discover’ information about artists that is not otherwise available.

Andreas Blumauer of PoolParty described the advantages of graph databases over traditional databases, including their ability to map to data and documents stored in other systems and databases, and to support more complex queries.

In short, the need for taxonomies, and for people who know how to construct them, looks stronger than ever.

Posted in Conferences/workshops | Leave a comment

CESSDA-ELSST New Release 6 September 2016

The European Language Social Science Thesaurus (ELSST) is available in 12 languages: Czech, Danish, English, Finnish, French, German, Greek, Lithuanian, Norwegian, Romanian, Spanish and Swedish.

We are proud to announce that since September 2016 the average of all translated Preferred Terms, including all Broader Terms, Narrower Terms and Related Terms is 98%, where most languages are fully up-to-date with the source language, containing 100% translated PTs. The number of updates and improvements made throughout all 12 languages ranges from 150 to over 700 per language, with an average of 350.

We are also delighted to announce that our Swiss colleagues at FORS have completely reviewed and consistently applied all diacritics to their French translations.

Finally, each preferred term (PT) and its translations have a link to their equivalent SKOS Concept.

Details on the types of changes made since the previous release January 2015 can be found on Changes to ELSST

Posted in Structural | Leave a comment

Embracing the 'Data Revolution’: IASSIST Conference 2016

Suzanne Barbalet

Summer began early for many of us who had travelled to Bergen to attend IASSIST 2016 held from 31 May to 3 June and we were grateful for the wonderful hospitality provided by Norwegian Centre for Research Data (NSD).

CESSDA colleagues were well-represented at this year’s conference. The opportunity to meet Taina Jääskeläinen, Gry Henriksen and Irena Vipavc Brvar was particularly welcome.

Source: John Shepherdson. Bjorn Henrichsen, Director of NSD, with Heidi Tvedt and Gry Henriksen

The IASSIST Blog reports on the conference and provides an excellent summary of a selection of presentations in the parallel sessions. Some CESSDA work not mentioned in this blog includes a poster on a CESSDA Work Plan Task presented by Anne Etheridge and her co-authors, Wolfgang Zenk-Möltgen and Mari Kleemola, and John Shepherdson’s outline of the forthcoming CESSDA Research Infrastructure (CRI).

In the parallel sessions entitled ‘Big Data, Big Science’ Aidan Condron and I presented complementary papers. Aidan introduced the UK Data Service’s ‘big data’ architecture. The Service is in the process of designing an open data platform for social science which is implemented through a data lake. When complete it will enable social scientists to analyse resources ranging from large and complex datasets to combinations of data sources. Taking up a thread in Matthew Woollard’s IASSIST plenary address, in which he said that it is most useful now to talk about ‘new and novel’ forms of data, Aidan advocated taking an expansive view of how, when, and where data can be collected, stored, linked and analysed. Using a case study from the project Smarter Household Energy Data he demonstrated how exploratory data analysis methods could be employed to prepare the data for use within the social science research community.

My presentation, co-authored by Nathan Cunningham, focused on the requirements for a vocabulary service to augment this open data platform. Such a vocabulary service, it was proposed, could benefit from the use of a classification scheme to organise subject access for the purpose of exploratory data analysis. An application of the Universal Decimal Classification Scheme (UDC) had been trialled within the Archive as a tool to manage subject categories. Aida Slavic, the editor of UDC, has argued that ‘free text’ searching abated the interest in classification throughout the 1980s and 1990s (Slavic, 2008), but notes that the advent of subject gateways somewhat reversed this trend by using classification schemes to support mapping between different indexing systems (Slavic, 2006). These models inspired a trial application of UDC by the UK Data Archive.

The trial demonstrated that legacy classification is not a difficult task. For the same reasons that ‘free text’ searches are successful in the retrieval of important social science research concepts, a generic description of the research topic, via title and abstract, enables the subject content to be quickly classified. We reported on the trial and outlined an application for the use of UDC to support a vocabulary service to augment this open data platform.

Data librarians were well represented at this year’s conference and references were made to the forthcoming publication: Databrarianship: The Academic Data Librarian in Theory and Practice, which promises to be of interest to all colleagues.

Posted in Conferences/workshops | Leave a comment

SERISS project

Lorna Balkan

The UK Data Service has recently received funding to participate in an EU-funded project that complements the work that is being carried out within the CESDDA-ELSST project. The Synergies for Europe’s Research Infrastructures in the Social Sciences (SERISS) project is funded as part of the European Union’s EU Research and Innovation programme, Horizon 2020 and runs from 1 July 2015 – 30 June 2019. The kick-off meeting was held in London on 21-22 September 2015.

Overview
SERISS aims to address some of the key challenges for cross-national data collection, including facilitating greater harmonisation of data collection, analysis and curation across social science infrastructures. Work package 3: ‘Maximising equivalence through translation’, will investigate different approaches to the translation of questionnaires, and conduct a comparative empirical assessment of thesaurus keywords. This latter task, which the UK Data Service Thesaurus Team will be undertaking, will evaluate ELSST and compare the translation quality of ELSST concepts with that used in questions in surveys conducted by the European Social Survey (ESS) or Survey of Health, Ageing and Retirement in Europe (SHARE).

Part of the investigation of survey translation will look at the feasibility of applying computational linguistic methods to survey translation. It will be interesting to see if any of these tools can be applied to the translation of thesauri.

Methodology
Given time restrictions, the evaluation of ELSST will focus on the French and German translations only. The chosen evaluation method is back-translation – the target language translations will be back-translated, the results will be compared with the original Source Language (i.e. English) terms from which the translations were made, and any differences noted. The CESSDA portal will also be interrogated to identify multilingual surveys indexed with the translated terms, and the terms applied to these surveys will be compared with the Source Language terms and any differences identified. Ambiguous terms will be reviewed and corrected to make their meaning clear, as is happening already within the CESSDA-ELSST project.

Based on SERISS findings, ELSST translation guidelines will be reviewed and a set of best practice guidance established.

A further quality assurance process will be carried out later in the project when the preferred term labels and meanings of ELSST concepts will be compared with the translated question content of surveys to identify exact and partial matches in terms of meaning.

Other CESSDA partners are also involved in other parts of the project – ADP (Slovenia), CSDA (Czech Republic), FORS (Switzerland), GESIS (Germany), and NSD (Norway). We look forward to fruitful collaboration.

Feedback welcome
If anyone has any comments on the thesaurus evaluation task, we would be glad to hear them. Please either add your comment to this blog, or send them to the UK Data Service Thesaurus Team at thesaurus@ukdataservice.ac.uk

Posted in Uncategorized | Leave a comment