Archivismi: il giorno dopo l’upload

Logo di Feddit Logo di Flarum Logo di Signal Logo di WhatsApp Logo di Telegram Logo di Matrix Logo di XMPP Logo di Discord

Archiving: the day after upload

This post was last updated by 6 months does

This is a text automatically translated from Italian. If you appreciate our work and if you like reading it in your language, consider a donation to allow us to continue doing it and improving it.

The articles of Cassandra Crossing I'm under license CC BY-SA 4.0 | Cassandra Crossing is a column created by Marco Calamari with the "nom de plume" of Cassandra, born in 2005.

New episode of archivisms, continuation of yesterday's article!

This article was written on December 27, 2023 from Cassandra

Cassandra Crossing 561/ Archivismi: the day after the upload

Yesterday we made our first upload and saw the results. But has anything changed today?

In the last episode Cassandra tried to tell you part of how the Internet Archive works. We have just scratched the surface of its features, and to avoid getting bored we tried to archive the .pdf file of an article by Cassandra, and describe what happened.

We thus realized that we had started a process that was as complex as it was slow, but fortunately completely automatic. So slow that after more than half an hour it still hadn't finished. Returning to the document page today, we find the Browsers of objects of the active Internet Archive, and the process has been completed.

You can quickly flip through pages, have a very robotic voice read them, and select portions of text on any page. These seem like small things, considering that the source was a "modern" PDF, obtained directly from a Libreoffice document, but in fact the apparently "simple" PDF was broken down into a quantity of files, some of which we had not yet analyzed.

Even just from the names, we can easily understand that some character recognition OCR process has been performed automatically. These files, some of which are used by the Browsers of objects of the Internet Archive, allow the latter to view the document.

At this point some of the well-informed 24 readers will blurt out “But all this is absolutely trivial, it could also be done with Acrobat Reader, without all this embarrassment.” The dear reader is right on the specific fact, but wrong on the more general question. Yes, because by archiving the modern 3-page PDF we actually used a cannon to kill a mosquito, a frail and sick one at that.

Now it's time to try and unleash the full archival power of Internet Archive. For this reason, Cassandra took advantage of an archiving job that awaited her alter-ego Marco Calamari. It involved archiving a hundred issues of a small magazine, published in the last 30 years and exclusively in paper format.

The .pdf files generated by the various electronic layout programs used to create the magazine had already been collected, and fortunately preserved as a by-product. The scans of the first paper issues had also been created, by hand and in various ways, also in pdf format, but obviously not searchable, as the pages were "photos”.

All this material, even if already in digital format, would have required a very long time to be put together, aligned and published in a searchable and reusable format, particularly in "serious" archiving contexts.

In fact, the real, big problem was not to create a collection of PDF files, but to archive it in a useful, searchable and consultable way. Otherwise, as often happens, these files, although laboriously collected, would sooner or later end up forgotten in a flash drive at the bottom of a drawer, or in a corner of the commercial cloud, ephemeral and where no one (except GAFAM) would have been able to find them and use.

But it was enough to put together the 75 files of various formats and contents in a single PDF, using the very useful free software Pdftk, thus creating a single PDF of almost 1 terabyte, and uploading the latter to the Internet Archive, exactly as we did for the 3-page article. This file was also taken over by the system and "shredded" throughout the night; this morning it was already available.

All anomalies and differences had been resolved automatically, and a 662 page document, containing the entire magazine collection, was available, quickly browsable, selectable, searchable and listenable, and was created with a commitment of just a few minutes of time.

If we add to this the fact that the document has been archived redundantly in multiple data centers, and is located in a digital library that makes it available to anyone, freely searchable and viewable, it becomes almost astonishing, even without adding which is also available in ebook format (.epub) and which if necessary can be further "worked" for other purposes.

Just to describe in general terms what was produced during archiving, the original PDF was divided into pages, first of all to speed up viewing. Each page consists of a PDF file in a particular format, a background image, a scan of the original page, plus a selectable text layer, overlaid on the page and generated by subjecting the scan itself to OCR.

What's truly remarkable is that the system was able to correctly handle a mix of PDF files with different internal structures, from simple scans to structured PDFs, and bring them all back to a lowest common multiple made up of the layered PDFs of the individual pages.

Well, if all this doesn't seem like much to you, it's because this series of articles isn't suitable for you; it is instead suitable for the future digital librarians who, by chance or luck, have ended up on these pages. But you might still change your mind.

Stay tuned for the next episode of “Archivists”.

Marco Calamari

Write to Cassandra — Twitter — Mastodon
Video column “A chat with Cassandra”
Cassandra's Slog (Static Blog).
Cassandra's archive: school, training and thought

Join communities

Logo di Feddit Logo di Flarum Logo di Signal Logo di WhatsApp Logo di Telegram Logo di Matrix Logo di XMPP Logo di Discord




If you have found errors in the article you can report them by clicking here, Thank you!

Comments

Each article corresponds to a post on Feddit where you can comment! ✍️ Click here to comment on this article ✍️

Feddit is the Italian alternative to Reddit managed by us, based on the software Lemmy, one of the most interesting projects of fediverse.