While the overall project actually started in the last quarter of 2011 and continued into 2012, the current phases of the project kicked off in November 2015 and are due to end at the end of April 2018. By then the team will have digitised the contents of 3,576 archival boxes as well as museum objects, posters, banners and other materials. At one time we had five different workflows running concurrently – one to digitise bound manuscripts, another to digitise fragile manuscripts, another to digitise plain paper, another to digitise photographic prints and the final one to digitise slides and negatives. In all we would have captured close to 2 million pages with over 1.3 million being plain paper, over 300,000 being fragile papers and also over 300,000 being bound manuscripts. We also captured almost 82,000 photographic images and over 3,000 museum objects. The team worked exceptionally hard often operating in two shifts and at times in three shifts around the clock to keep to targets. They not only digitised the archive but ordered and documented it as well.
Kulu Mushaka operating Africa Media Online’s modified Atiz machine on-site at NAHECS, University of Fort Hare Alice Campus. The machine is a v-cradle capture device that enables the capture of bound manuscripts. To get a higher quality output from the machine we customised the lights and put Zeiss lenses on the two Canon 5D Mk II cameras. We also added a screen that improved operator comfort.
Before we could even start digitising the collection, we needed to order the collection down to the item level. The physical archive was well structured with various subcollections and within those subcollections, there are various series. The series are made up of containers (mostly archival boxes) and within the containers are usually folders. Inside each folder are items. The challenge we faced when we arrived to do the digitisation was that the archive had been ordered down to the folder level, but not to the item level. So began the massive task of itemizing the entire collection which meant we did not actually start digitising the collection until 9 months after the start of the project and that itemizing process continued in parallel with digital capture until the end of 2017. In order to do that we had to grow the team after the first nine months from six to 14 members.
Sne Mkhize and Olu Bamigboye operate our Phase One XF camera with Phase One 100 Megapixel digital back. This setup, with Broncolor lights was first used to capture photographic prints at 600 ppi and then we moved on to assisting in the capture of fragile manuscripts.
The manuscripts workflow started with the Itemizing Team where four or five team members sat assigning a number to each and every item in the archive. Next, the boxes were shifted to the Dividing Team who went through every folder and assigned each item to one of three workflows – the bound manuscripts workflow where the items were captured on a v-cradle capture device, the fragile papers workflow captured using an overhead camera, and the plain paper workflow captured on form-feed scanner. The boxes then moved across to the Inventory Team who sat capturing onto a spreadsheet for each item: its place in the archive; the number of pages that need to be captured; and the workflow it had been assigned to. In this way, we built up an inventory against which we could check at each subsequent stage of the process. From the Inventory Team, the boxes moved to the Capture Team and the relevant workflow where the items assigned to that particular workflow were captured and returned to the box before the box moved on to the next workflow. Mostly the boxes were returned to their place in the archival storage rooms at NAHECS in between their capture at each workflow station. When all items were captured in a box, the box was returned to the archival store after the items were “de-divided” from the three workflows and recompiled in the right order in their folders and within the box.
Nkanyiso Ngcobo and Phelelani Ntsikithi cleaning the Scamax machine. The belt-driven form-feed scanner is very gentle on paper and so is ideal when used in an archive with mixed manuscripts like the ANC Archive. Operated two shifts a day, this scanner captured over 1.3 million pages in about a year and a half.
To ensure that the digital archive reflected the physical archive in its structure, in Phases 1, 2 and 3 back in 2011 and 2012, we had developed a system of digital folders that could represent the arrangement of the physical archive. So when pages were captured using one of the capture devices, they were saved into this folder structure on an external hard drive. These hard drives were then sent up to our head office in Pietermaritzburg (we ended up with close to 70 hard drives of 1, 2, 3 or 4 TB in size that were rotated back and forth between Alice and Pietermaritzburg). There the Processing Team went to work ensuring that each and every digital file was up to standard, was cropped and colour corrected in line with colour targets that were captured with each batch. Files that were rejected were recorded and the information sent back to the Digitisation Team for recapture. Maintaining the same folder structure, these processed files were then saved out from Raw to Tiff format at which point in time the Quality Control Team checked each file and checked the folder path of each file against the inventory or each collection that had been compiled by the Inventory Team.
We used a Nikon D800 with a sharp macro lens to capture negatives and positives which produced higher quality reproductions than high-end scanners we used in the past. We also employed a special workflow to deal with the colour cast on colour negative film.
Currently, the Quality Control Team is working long hours, working sub-collection by sub-collection, to get the entire collection checked, compiled and submitted to the MEMAT Digital Vault. From there the files pass to the domain of the IT Team. They are ingested into the Digital Vault. As part of that, they are processed to a Jpeg2000 format that meets specific archival standards for long-term preservation. Once each page of a manuscript is ingested into the Vault, it is then run through an Optical Character Recognition (OCR) engine to make it searchable and all pages are gathered into a PDF/A. Then the manuscript is made available on the web interface of the ANC Archives Research Website.
Africa Media Online photographer Scott Cronwright assisted by Lunga Poni (NAHECS) and Sphelele Ntsikithi (AMO) capture oversized flat materials including posters and banners. The Phase One XF camera is suspended on the digitisation rig we designed and built. The height of the camera is set to 150 ppi, the correct setting to be Metamorfoze compliant for this size material.
The work of aligning the folder path of the digital file with the structure of the physical archive and the record of the structure of the physical archive in the inventory spreadsheet, has also been supplemented by the team going beyond the call of duty to update the ANC Archive finding aids such that all four match each other.
At the end of 2017 as part of our Christmas celebration, the team from NAHECS, the Africa Media Online team and representatives from the ANC and the funder had the privilege of engaging in a team building exercise in the Hogsback. Here NAHECS employee who faithfully provided security for the project, Zalisile Victor Cakucaku, abseils off a tower watched by an adventure guide. Bra Z, as he was affectionately called, tragically passed away from an infection in his leg in October 2017. He is sorely missed.
The one aspect of the project that will be continuing in the coming months is the capturing of metadata, particularly against photographic images. That involves both a local team capturing information off the back of photographs and an experienced remote team that takes that information and fills out various metadata fields.
The completed material is already in the process of being ingested into the Digital Vault, passing through the OCR process and starting to show online. One step of that process takes 5 seconds per file which is the current bottleneck. The amount of data is so great that at that rate running 24 hours a day non-stop it will take between 4 and 5 months for all the material to appear online!
Main Picture: Early in the project, Phelelani Ntsikithi, Faith Marango and Sphelele Ntsikithi work on dividing manuscripts into the three workflows.