Archivismi: archiviamo Cassandra, parte prima

Logo di Feddit Logo di Flarum Logo di Signal Logo di WhatsApp Logo di Telegram Logo di Matrix Logo di XMPP Logo di Discord

Archivismi: let's archive Cassandra, part one

This post was last updated by 6 months does

This is a text automatically translated from Italian. If you appreciate our work and if you like reading it in your language, consider a donation to allow us to continue doing it and improving it.

The articles of Cassandra Crossing I'm under license CC BY-SA 4.0 | Cassandra Crossing is a column created by Marco Calamari with the "nom de plume" of Cassandra, born in 2005.

We are coming to the end of archivism: archiving Cassandra

This article was written on December 31, 2023 from Cassandra

Cassandra Crossing 564/ Archivismi: we archive Cassandra, part one

Today we change sides of the coin; no technique, let's tell a true story.

In last three episodes we worked on the Internet Archive, but only with simple examples.

However, archiving often means archiving a quantity of different materials, with a final purpose. And in these cases there are no simple examples that suffice; the devil is always in the details, and the most useful information is learned by listening to real stories and experiences.

So today Cassandra will tell you a true story, still unfinished, and will only talk about details that do not have to do directly with the Internet Archive, but with the preliminary stages of a generic archiving campaign, in which the longest job is to find, collect and above all prepare the material for actual archiving.

And what better than telling the Cassandra Crossing archiving campaign? Yes, Cassandra had been putting aside pieces destined to be archived for some time. But let's go in order.

The origins of Cassandra Crossing date back to 2003, while regular (well, almost regular...) publication began in 2005 on Punto Informatico. He then continues on other newspapers like Zeusnews.it, sometimes in parallel. It also extends on paper and on video.

The materials available were of the most varied types; text files with and without accents, word processor files of different types, pdf files and so on and so forth. Many files have obviously simply been lost.

So it was that several years ago Cassandra looked for a way to recover, homogenize and centralize all the corpus by Cassandra.

As in all things, it is better to throw yourself headlong into a job, but think, plan, do and then look for an even better way. After several attempts, Cassandra tried Medium.com, a social specialized for writers or aspiring writers. In addition to providing a single point in which to write with a discrete online editor and store articles, Medium.con has an excellent functionality for importing text from any site, even with pages full of advertisements or various effects.

It features a user data export feature, which saved individual articles in HTML format.

This was how Cassandra centralized the archive on Medium.com, not without having dedicated a lot of time to finding, with search engines, the links to old articles, never archived locally or in any case lost.

But the solution was not satisfactory for various reasons, starting with the fact that the articles were in a cloud, and worse still in what was essentially a social network, with all the harmful aspects that Cassandra hates and often tells you about.

And so Cassandra decided to start archiving Cassandra Crossing on the Internet Archive. And since we were starting from a complete archive in a homogeneous format, it seemed like it should be a walk in the park. “Huge mistake”, as he says Jack Slater.

In fact, the necessary homogeneity is not just a question of format, but above all of the internal structure and homogeneity of the information stored in the article files.

Let's start with the simplest thing: file names. Obviously Medium.com uses its own philosophy, and forms the name from the publication date (not the original one, but the one on Medium.com), adding a binary identifier and a derivation of the title.

Something like

2023–12–29_Cassandra-Crossing — Archivismi — the-organization-of-documents-in-Internet-Archive-e83b9e3b9cca.html

Now, it is true that files can also be renamed by hand, but this is a difficult job when there are hundreds or thousands of files. Automating becomes essential. Fortunately, powerful scripting languages and libraries are available in Linux that are miraculous.

You can therefore rename files quite easily by removing, adding and rearranging information. Paradoxically, the most difficult thing was to automatically insert the article number at the beginning of the file name.

Luckily Cassandra, who is sometimes methodical, had the habit of writing the article number at the beginning of the subtitle, putting it in round brackets. With some small alchemy of regular expressions it was thus possible to extract it automatically and use it to build a more “human” file name as

562_Cassandra-Crossing — Archivismi — the-organization-of-documents-in-Internet-Archive.html

Then it was necessary to process the files, clean them up and convert them into archiveable formats.

The first necessary step was to clean the html files from a huge quantity of hidden tags, totally useless for defining the text but necessary to guarantee the functionality of the Medium.com site. In fact, like all social networks, Medium.com implements the export functions at the minimum wage required by the (always be praised) GDPR, and therefore produces complete data, yes, but not suitable for easy reuse.

The best solution Cassandra found was to convert the html to markdown format, filter out lines that did not contain useful information and convert it back to html. This little miracle was possible thanks to document conversion libraries Pandoc, aided by normal Unix utilities such as grep.

Now that the files are cleaned up and given a human name there is still the problem of the images included in the files. In fact, the images are not exported with the other data, and the URLs of the images all point to the Medium.com servers, which therefore, despite all the work done, still has in hand an important part of the articles.

It is therefore necessary to convert the remote images into inline images, within the same html code, encoding them in base64. This process, conceptually simple, must usually be done by hand for each individual file and URL; luckily there is a way to do it automatically, via the parameter — self-contained, added to the Pandoc html rewrite command.

For archiving, the main format chosen is PDF, which does not have this problem because by converting HTML to PDF the images are inserted directly into the file.

In order not to miss anything, again thanks to the miracles of Pandoc, Cassandra was able to convert all the formats already produced into PDF in a very simple way, the starting HTML, the markdown and the simplified HTML, then choosing the best one.

For now, you can find the result here.

In conclusion, a couple of "full" days of work led to this 39-line bash script, certainly not optimal or error-free, which we will comment on here anyway, just to give you an idea. Understanding it in broad terms is enough. But if you need it, reusing it would be a great time saver for you.

# Procedure for preparing to archive articles
# by Cassandra Crossing
#
# various initializations
_base=”./tuttocassandra_processing/”
_base2=”./posts/”
_base3=”./markdown/”
_base4=”./temp/”
_base5=”./html/”
_base6=”./pdf/”
_temp=”temp.txt”
#
# changing working directory, creating directories and cleaning files
cd “${_base}”
mkdir markdown html temp pdf
rm ./markdown/* ./html/* ./temp/* ./pdf/*
cd “${_base2}”
rm “${_temp}”
_dfiles=”*”
#
# main loop start
for f in $_dfiles
do
rm “${_temp}”
#
# extraction of the article number
g=`grep -Eo -m 1 '\([0–9]+\)' $f | tr -d '()'`
g=”000″$g
g=`echo $g | rev | cut -c 1–3 | rev`
h=`echo $f | cut -d '_' -f2- | rev | cut -d '-' -f2-| rev`
#
# formation of new file name and copy with new name
i=$g”_”$h
echo “ — -> Identifier: $i”
cp $f “../$_base4${i}.html”
#
# conversion to markdown format, cleanup and conversion back to html
pandoc -f html -t markdown “../”$_base4$i”.html” > “${_temp}”
grep -v “^:::” “${_temp}” |sed -e 's|{#.*}||g' > “../”${_base3}$i”.md”
pandoc — self-contained -f markdown -t html “../”${_base3}$i”.md”> “../”${_base5}$i”.html”
pandoc — pdf-engine=xelatex -f markdown -t pdf “../”$_base3$i”.md” > “../”${_base6}$i”.pdf”
#
# cleaning and end of cycle
done
rm -rf “${_temp}” “../$_base4”

(If you need to copy this procedure, replace the curved double quotes with normal ones, the curved single quotes with normal ones, the long minus sign with two normal minus signs. Medium.com doesn't allow you to write as you want...)

And that's all for today too. Stay tuned for the next episode of “Archivists”.

Marco Calamari

Write to Cassandra — Twitter — Mastodon
Video column “A chat with Cassandra”
Cassandra's Slog (Static Blog).
Cassandra's archive: school, training and thought

This tag @loyal alternatives is used to automatically send this post to Feddit and allow anyone on the fediverse to comment on it.

Join communities

Logo di Feddit Logo di Flarum Logo di Signal Logo di WhatsApp Logo di Telegram Logo di Matrix Logo di XMPP Logo di Discord




If you have found errors in the article you can report them by clicking here, Thank you!

Comments

Each article corresponds to a post on Feddit where you can comment! ✍️ Click here to comment on this article ✍️

Feddit is the Italian alternative to Reddit managed by us, based on the software Lemmy, one of the most interesting projects of fediverse.