On classification

Lorna Balkan

The University of Essex recently hosted two events of interest to the CESSDA-ELSST project. The first was the Big Data and Science Summer School hosted by the Faculty of Science and Health at the University of Essex on 7-8 July 2014, followed by the 50th anniversary meeting of the British Classification Society on 9 July.  The subject of classification was raised at both events.

Paul Taylor of the Centre for Health Informatics and Multiprofessional Education, University College London, discussed the proposed linkage of administrative healthcare data in the UK, namely hospital discharge summaries and GP data. These kinds of data are used to inform managers and policy makers, so accuracy is important. Hospital records are coded by professional coders, using the International Classification of Diseases (ICD) and OPCS, while GPs code their own data. Taylor explained that lack of precision in code descriptions (e.g. IDC code Y34 is described as ‘unspecified event, underdetermined intent’), and plain human error account for lack of consistency. He argued that classification systems should not only have clear definitions, but they should not be overly complicated. He gave as an example the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT), the new classification system being adopted by the NHS for healthcare records, which is proving unpopular with GPs due to its complexity.

ELSST developers have long recognised the need for clear scope notes and definitions of terms, and we are adding definitions where we can when updating HASSET and ELSST.
The importance of defining exactly what we mean when we use a code/term was raised again at the British Classification Society meeting.

The focus here was on automatic classification. Professor David Hand of Imperial College London opened the meeting with a review of the main challenges still facing the two main types of classification – unsupervised, where the set of classes to be found in data is not predefined, and supervised, where the set of classes is known advance. Automatic indexing, which the CESSDA-ELSST team worked on in the SKOS-HASSET project, is an example of supervised classification. The set of known classes in this case were HASSET terms. Professor Hand cited uncertainty of class labels and definitions as one of the main problems for supervised learning, particularly where meanings vary over time.
Part of the work we are doing to update HASSET and ELSST is to make sure that terms and their definitions are current.

A more fundamental question was also raised at this meeting, namely whether or not there is any such thing as a ‘natural’ class, i.e. a class that is found in nature, or whether all classes are merely ‘convenient’. Since ELSST is a social sciences thesaurus, I don’t think any of us would claim that its terms are anything other than convenient tools for studying society.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s