català | castellano | english home   sitemap   aviso legal   créditos   contacto  
home home

Gavan McCarthy, Australian Science and Technology Heritage Centre, University of Melbourne, Australia

"All is Digital: Scientists records in a connected information universe"


In the abstract for this paper I said ‘The archiving of the records of scientists from the mid twentieth century onwards has generally proved to be problematic and this is usually associated with the introduction of new information technologies, in particular, digital technologies.’

There are various issues that I would like to tackle in this presentation which stem from three very different projects that are currently taking up most of our time at Austehc:

The Australian Dictionary of Biography Online project
The Radioactive Waste Information Preservation and Transfer safety report for the IAEA – covered earlier but I will briefly re-visit it.
The archiving of the personal records of the Aboriginal anthropologist and historian, Diane Barwick

All three projects have a major digital component that drives the work and they reveal different aspects the digital archival world.

When I first started drafting this paper I addressed the issues associated with the digital world from a conceptual view point and thus attempted to isolate the underlying drivers or fundamental issues – as much as anything to help develop my own understanding.  However, on further consideration, particularly in the context of this conference, I have also incorporated the opposite approach and come from the pragmatic angle, via real projects and real issues, to hopefully reveal those some fundamental issues and challenges that drive the day-to-day decisions we have to make at work. 

However, the conceptual approach gets first go so, I would like to start with that reference I made in the abstract to the work of Hans Christian von Baeyer, in Information: the new language of science (Weidenfeld and Nicholson, London 2003), who argues convincingly, from ideas that he picks up from the landmark Twentieth Century American physicist, John Archibald Wheeler, that all information, at its most fundamental, is digital and that, furthermore, the physical world comes from information.

In other words, all information, and by extension the physical world, can be reduced to a sequence of 0s and 1s. Baeyer takes 234 pages to reveal the complexities and intricacies that this postulate raises which I will not attempt to address here. Moreover, he extends this notion into the quantum world and I will not pretend that I understand this argument though at the time of reading it seemed to make sense.  But for those of us grounded in the physicality of records, there are embedded issues in this assertion that should cause us to think more deeply about the work we do and how we approach the archival challenges of the digital world.

If all is digital – then what is it that we call “digital” and what is material-based information (analogue, print, paper, film etc) if it is too at its essence digital (i.e. 0s and 1s). Perhaps the answer is simple, for information usable by human beings it is all coded, abstracted, stylised, given form, structured and embodied in an object that exists in the human dimensions.

The total amount of information in the form of 0s and 1s in even 1 gram of paper, where we are talking about the sub-atomic level, (perhaps with some text – areas of coded black and white regions) is beyond comprehension and certainly will never (my assertion) be within the bounds of computer technology to manage.  

So when we compare the various codings of the human information content that may be present in a gram of paper we have at one extreme a “digital source” beyond “size” and at the other extreme a coded version of the content, say in ASCII, that might be as small as just a few bytes – a stream of 32 0s and 1s.

On one side we have code that can be managed by contemporary digital technologies (for example: the ASCII code; digital image code in the form of jpeg or tiff or dng; or in proprietary code such as MS Word document code), and on the other a digital source beyond management.

However we now must jump very quickly, make a quantum leap, to the point I want to make, otherwise this level of inquiry will fill the whole time available. If we want to utilize digital technologies to assist us in our human information endeavours, particularly in relation to archives and records we need to compromise. And the questions, then are, where are the thresholds in the digital coding where digital sources meet the same human requirements that have evolved into the records we find work quite well (but are not perfect) in the material world.

A practical example – a myth created by the technologists - is that it is much cheaper to store digital information than material-based information. However, this is like comparing apples and salt (not even oranges – they are too similar). We are comparing two codings of information and both need to be evaluated for the properties they contain before any comparison of cost and value can be made.

In practice, as you increase the level of digital coding of a source to meet additional requirement the file size significantly increases in a non-linear fashion and the cost of handling storage and management also increases in a non-linear fashion. It would probably be possible, at least conceptually, to graph the various costs and look for the various thresholds and cross over points where the costs of keeping a digital version of a source with sufficient of the necessary properties of the material version intersect.

When you take into consideration all the aspects of working digitally including bandwidth, speed, storage, maintenance my feeling is that when you compare like with like (apples with apples) today, the cost of digital exceeds the cost of material-based preservation. There are of course other considerations like machine and technology dependencies that need to be traded off against added functionality and utility.

However, I do not think this will not always remain so and at some point in foreseeable future the crossover will occur and from that point “all may be digital” from a preservation and access perspective.  Bearing in mind that we must always bring the sources back to the human interface in material-based versions created as required for particular human uses.

The Einstein Information Equation

I ~ mc2 

Although this maybe entirely fanciful the number of bits in a gram of paper may be a figure in the order of the equation given above.

I ~ 1 x 330,000 x 330,000 = 108,900,000,000

In Bytes  = 27,225,000,000. Is this about 28,000 megabytes – 28 gigbytes. This seems low and therefore represents a particular level of coding but not the full information quotient.

How much do I want to go into this? No further!!!

What are the pragmatic challenges?

I think it is time to leave the conceptual world and address some pragmatic issues.

1. The Australian Dictionary of Biography Online

In 2004 the Australian Science and Technology Heritage Centre commenced a major, multi-year project with the Australian National University to take the print version of Australia’s national biographical dictionary, put it into a database and make it available online. What we brought to the project was the database technology in the form of the Online Heritage Resource Manager, which was developed originally to run Bright Sparcs but had been further developed as a generic contextual information management system. The Australian National University brought to the project the Australian Dictionary of Biography itself both the organization that had been producing the biographies since the mid-1960s and the final published versions that were to be the starting point for the online edition.

The key challenge in this project has proven to be the amount of variability that was found in the original text despite the appearances of uniform structures and practices. What looked to be elegant, well-structured information produced to the highest standards of scholarship has without fail in every aspect been found to contain hidden inconsistencies, unrecognized variables, kludges, and one-off compromises that have surprised all involved.

The point for this paper is that this experience reinforces the fundamental differences between the two information technologies and the compromises that must be made when using either or both. This applies particularly to the ADB as both the print versions (existing and forth-coming) and the online version will co-exist for the foreseeable future.

2. Radioactive Waste Information Preservation and Transfer

The key challenge in this project was understanding why this process was failing and therefore the need to re-conceptualize the problem space to come up with new strategies and ideas.

There are two key issues related to the preservation and transfer to future generations of information important to the history of science, technology and medicine namely:

· Epistemic failure – where there has been inadequate preservation of the knowledge necessary to explain the context, structure and meaning of the information (known in some fields as metadata); and

· Physical loss – where physical changes in or destruction of either the medium or the supporting technology have rendered the information unreadable.

The report explores problems surrounding the effective use of information resources through time and the critical role played by contextual information in enabling them to be meaningfully understood.  The intrinsic properties of contextual information are examined and a strategy proposed whereby a global network of contextual information could provide the framework for integrating all information resources related to radioactive waste disposal activities in a meaningful way. It is noted that the concepts developed in this report could be applied equally to other areas where there is a need for the preservation and transfer of information to future generations.

3. The Records of Diane Barwick

Diane Barwick was a Canadian who came to Australia in 1960 to commence work on a PhD in Aboriginal anthropology at the Australian National University – a young female student, at a very new university (founded in the mid-1950s), tackling a subject that nobody else had attempted namely, an anthropological study of Victorian urban aboriginal communities. She was without doubt a courageous and extraordinary individual. She was born in 1938, so was just 22 years old when she commenced her PhD. Her study became her life’s work which has tragically cut short when she died completely without warning in April 1986 as the result of a brain haemorrhage.

Her records were subsequently moved from the various offices she had at the ANU and the Australian Institute of Aboriginal Studies to the family home where they have remained ever since. Diane had spent 26 years utilising every possible information source she could locate, from existing government archives to interviews and personal knowledge from the people themselves, to piece together the family genealogies of a people that had been appalling treated by successions of colonial and post-colonial administrations. Her records are vast, complex and culturally very sensitive, and the key to their structure, function and form was mostly in her head when she died.

The key challenge with this project was/is to find the key to a vastly complex information set and then re-present the materials in a form that would be manageable for future users.

The challenge then is to find structure and form in the records as left but also to find records that act as keys to understanding other records in the collection. For example: her field or operational notebooks which map the unfolding of her anthropological work and subsequently the complex web of materials that grew with her as she moved through the various phases of her life.

It is planned to digitise part of the collection (using digital photographic imaging) to improve accessibility. However, this will only work if we can come up with a functional structure that will provide future users with easy access to all the contextual information they will require to make sense of her notes, files, charts, and coded references. This is probably the most challenging archival project I have yet come across and I am glad of the 25 years of archival experience that I have had to prepare me for this.

Once I actually started the “accession” process – that is the survey and mapping of the collection as found in the home (and for this part of the process digital photography is also extraordinarily valuable and useful) – I realised that the work on the inventory description (the file by file documentation and imaging where selected) should not be undertaken by a professional archivist but by a young person looking to make a career in Aboriginal history or anthropology under professional archival supervision. It is too important that the knowledge that was in Diane Barwick’s mind be transferred to another mind (as far as possible) and further grown and utilised over another life-time. It is not enough to just have that knowledge abstracted into an archival finding aid – no matter how good it might be.

Science – a connected information universe

Science is a connected information universe.

The primary output from scientific practice is typically preserved as stylised information condensed and abstracted in scientific publications and transferred to future users through the established library system.  Furthermore, the scientific practice of citation of related work captures part of the epistemic framework required for comprehension.

The publications create a connected information network but it is essentially still only exists conceptually.

Each only links (cites) those works it is required to cite to establish its position in scientific practice – it is therefore localised and creates its own context – but this context at the time of creation is only looking backwards. If the paper is cited sometime in the future – the original paper is not modified to reflect the historical network of connections that map through all directions in time.

Bibliometric studies have been undertaken to measure and map these linkages for particular purposes (within disciplines – the richer rather than the poorer ones – and for different reasons – for measuring quality and impact of scientific work for the purposes of allocating funding) but this information universe is so large that has not been possible, with past technologies and practices (in particular print-based systems) to conceive of a means by which this objective may be reached.

However, the web and semantically rich markup languages fundamentally change the operating principles and a fully dimentionalised network of citations becomes a possibility.

And it is not through a huge increase in work (and therefore cost) but through working smarter – being clever with the use of metadata – and understanding the fundamental information infrastructure that would be required to support such a venture.

Publications => are not just an endpoint but part of an ongoing process.

But publications are just one component of the information universe of science. Other components in which information is bound include:

Knowledge held in the minds of scientists
Transactional documents including such things as correspondence, reports, lab note books – the stuff that archivists generally deal with
Objects – samples, equipment, buildings, products, etc

The information imbedded on all these components of the information universe (which is a particularly human thing) are all, in some way, records of past activities and as such play a role in the information universe in which archivists play an important role.

It is not sufficient for archivists to work on their component part in isolation – it must be done in way that enables all the elements or components to be interconnected and utilised in a functional way.

The introduction of a public domain information network with universal acceptance, such as the web, fundamentally changes the nature of the public domain.

In the past, being published meant that a publisher (entity 1 – person/organisation) entered a contract with an author (entity 2 – person/organisation) to print and distribute a work (entity 3 - resource) at certain time (entity 4 – event) from a certain place (entity 5 – location). The very act of publishing in this way creates a network of relationships between entities that imbues the product with various qualities that enables users of the product to substantive foundation for evaluating the meaning they may extract from the content of the resource.

Citations to other resources further extend this network.

However, there are other links that are less explicit such as links to language, especially links to glossaries and dictionaries that capture the meaning of the language at the time of creation. But there are also other assumed relationships that the readers are presumed to posses so that they can decode the publication such as: a fundamental knowledge of the technology of the time, the modes of practice, the political framework, the governance and review processes, the nature of funding, the limitations imposed by society in general, the influence of family and social affiliations, the characteristics of ancillary communications technologies, just to name a few.

In keeping information for the future, how do we make sure that the user has enough of this contextual information to decode what we preserve?

So what changes in a digitally networked information universe like the web?

First off – every information resource becomes a potential node in the network.  A node being a place in the network with which relationships can be established.

This means that all traditional publications, all archival records, all objects – indeed any object that current holds information could either exist directly in the network if it is a born digital object or it could be represented in the network as digital version of the original (albeit a lessor version) or it could be represented in its Platonic form, as metadata that describes the original including information about where the original can be found.

So in that sense it is possible to conceptualise a networked information universe that not only mimics the real world – the actuality of what happens – I nearly went on to say - but indeed could include more information – but I wonder whether this is really possible given that everything we are talking about is really a human construct and all born digital information objects have some sort of reality otherwise that link to the human mind is not made.

Indeed it is at that human-information interface that we should be focussing our attention and perhaps the issues we face now with regard to digital technologies is that much effort and faith has been place in the underlying technologies and their development and not on the essentially unchanging variables that determine the nature and the limits of the systems that could be developed – human beings.

Currently the web contains billions of nodes, that is, information resources that are linkable. The main agents in the web network are the people (entity 1) and the domain in which they publish (entity 2). Other aspects such as time and place while they exist are more problematic. However, because in a network like the web which continually evolves there is a sense of the continuous present many users have failed to realise that what this means is that they have a dynamic past – that is one that is continually in a process of becoming.

Unlike the limitation imposed by print, an information resource in the web is not limited to expressing a context that is backwards looking in time but can also review itself in terms of how it was used and viewed after its moment of publication.

Perhaps the success of the web is best expressed in a clever piece of information technology that utilises human agency and action for its success – Google. Google is extremely successful at finding useful and perhaps authoritative resources on topics. It utilises the scale-free nature of the web as a complex network to identify those resources most cited. It works on a statistical basis which is why its strength is locating the most popular resources dealing with a topic (as expressed in the language of the resource) but its weakness is that it is not very good at locating specific or particular information resources.

For evidence-based endeavours like history, medicine, science, the law, government, and in some industries in particular like radioactive waste management it is not only important it is critical.

For archives, the evidence-based foundation is fundamental because that is what we are about – preserving evidence of past actions for those in the future to use.