During my undergraduate degree, I tended to view PDFs as a necessary evil. Like most of us, I loved printed books even for their most superficial and sentimental trappings and I was in no way willing to give them up for the dubious convenience of a PDF. Moreover, the PDFs circulated for class ranged widely in quality from illegible scans of damaged books to digital type directly issued from the publisher. This disparity led me to use them largely as a means to an end – useful for filling in the gaps in my library, but in no way capable of replacing it. My attitude has changed significantly for two reasons which I will discuss below.
Politics of Digitization
While the digitization of the public record has been underway for quite some time, it remains one of the more divisive issues because the formatting of the digital text has the power to create and eliminate levels of accessibility and functionality. This gives rise to a myriad of approaches and redundant labor even within disciplines where the general standards of fidelity to the original text are more or less clear.
Hosting all of our texts on the web in HTML format seems, at first, the most logical choice. But when we consider how susceptible HTML versions of scholarly editions are to copyright law, we can see how semi-private digital editions like PDFs do, in fact, have a considerable strategic value. As I have argued, our mnemotechnology is most likely to advance if it does not deviate too rapidly and drastically from the technology of the book. By preserving the facsimile of the printed page, PDFs do not presume to “revolutionize” the book nor do they restrict us to a methodology or workflow tacitly encoded in file types of more limited compatibility. They are clunkier than other digital formats, but this remains a valuable protective mechanism that has not yet been rendered vestigial in the slow evolution of text.
While many academic publishers offer digital editions in forms other than PDF (e.g. epub, mobi, azw), we should not be so quick to view these as legitimate alternatives. Not only do many of them have digital rights management (DRM) restrictions that would interfere with our ability to extract and share larger selections of text, they also tend to compromise (if not butcher) the typography and aesthetic of the page. Many seem to have been converted into digital text in a rather desultory manner by some third party with little regard for the layout choices made by the publisher. Their ISBN numbers are sometimes missing or inaccurate and the page divisions which, for now at least, remain the fundamental unit of academic citation, have often been done away with entirely. Even if we eventually start cataloging and citing digital texts in some other way, this lack of compatibility with the current academic methodology does little to facilitate the shift.
The poor quality and low opinion of digital editions behooves the academic publishers who still depend, to a great extent, on the sale of individual copies of works. It’s important to realize, however, that this precarious situation cannot be sustained for much longer; the rather sad attempts to come up with a viable alternative to the facsimile PDF should be seen as a last ditch effort to cling to a univocal copyright (often at the expense of the individuals and institutions whose intellectual property is nominally protected). While visually speaking, facsimile PDFs appear faithful to this model of the book, they are, in fact, quite treacherous; the full-resolution page images of which they are comprised can always, eventually, be ripped from the DRM structures that contain them making it more and more difficult to price publications based on the quality of the text itself.
The other reason for my shift to a PDF based library is far more pragmatic. I really had no idea just how many high-quality, full-text PDFs were available on illicit digital libraries hosted on servers abroad. These were so expansive that, upon discovering them, I found myself binge downloading books for hours (sometimes days) on end. I knew that what I was doing was in some way impacting humanities publishing houses and, thus, myself. But, as a typically destitute English graduate, I couldn’t resist the temptation to get while the getting was good. I was ever wary that someone was going to blow the whistle that would inevitably stanch this seemingly endless reservoir of free text.
I remember when one of the earliest and largest libraries was taken down, a blogger had likened it to the burning of the library of Alexandria – a statement we should not immediately dismiss as overblown especially when we consider how revolutionary the digital liberation of the entire print archive would be. There really is something rather epic (and epoch-defining) about the battle between pirates and publishers. Were it not for all of the priceless, original manuscripts purportedly enveloped by the blaze at Alexandria, the loss of such a massive and free digital library would have to be regarded as greater in magnitude (at least as far as the sheer quantity of information is concerned).
Although numerous sites have risen and fallen over the years under legal pressure from the academic presses, the overall breadth of the free digital archive has only continued to grow. It shows the same tenacity as the popular torrent site, The Pirate Bay, which, interestingly, has contributed to its own modern day mythopoiesis by adding the figure of the hydra to its original pirate ship insignia. Each of the beast’s many heads stands for a mirrored server in a different country outside the jurisdiction of western copyright law, emblematizing the reality that, even as some instances of the free digital archive are inevitably cut down, many more will rise to replace them. I will not name any of these directly here, but suffice it to say that there are at least a few people in every department capable of pointing curious readers in the right direction.
In a matter of months, I had amassed over one thousand digital texts and was beginning to struggle to manage them. It was not as if I had PDFs strewn haphazardly across my hard drive either. The directory of subfolders I created to organize them was robust – too robust, in fact. The length of the various branching file paths were, in some cases, long enough to overload the processing power of my computer (especially during large-scale, backup and file transfer operations). One might ask why, with the availability of indexed searching, I would even bother to create such a folder tree? Why not just place all texts in the same folder and search for them by name? As easy as it might be to recall a text by its title or the name of its author, this requires that these names are readily accessible within our biological memory. Searching by name would work well enough for all of the texts that had formerly resided on my physical bookshelf – the texts I had consulted so frequently that their very position on this shelf had its own mnemonic value – but when my library grew several times larger than what could reasonably be shelved on my mnemonic bookshelf, this manner searching became far less practical. I needed to keep track of texts I only considered reading for just a few moments while skimming and downloading others.
The folder tree I kept on my hard drive, while far from ideal, was an attempt to deal with the dramatic expansion of my library. It was divided into three major branches: ‘Literature,’ ‘Theory-Philosophy,’ and ‘Assorted.’ This last category is of particular interest because it marks the failure of the two more dominant categories and archives the traces of an organizational problem that I will return to later in the discussion of the personal knowledgebase. It was in this ‘Assorted’ folder that I found it useful to group texts according to the more contingent, thematic interests of my classes and research projects (rather than by the names of the authors as I had in the other two categories). But this meant that I could either keep duplicate copies of texts in the author folder and in the assorted folder or decide which of these locations was most relevant. I eventually compromised by placing shortcuts of files in the assorted folder and keeping the original texts in the author folders, but not before I realized how the locations of texts in this system could neither be restricted or duplicated without introducing structural tensions – how, on the one hand, populating these subfolders inconsistently with shortcuts would never fully preserve the associative pathways and thematic designations that helped me remember them and, on the other, discarding it would erase all of the pathways it did store, however imperfectly.
As I accumulated the oeuvres of nearly every author of major and minor importance to me and my Amazon wish list was halved and halved again, I began to entertain the possibility of abandoning the printed book entirely. Being something of an absolutist, I didn’t want half of my library vaulting into the 21st century with the other half lagging behind in the Gutenberg era. At the time, I was facing the brutal impracticality of lugging my entire physical library across the country for graduate school and, while I wasn’t quite sold on the idea of reading everything on a screen, after considering how much time and effort I already spent transcribing citations from printed texts and how much time I might save copy-and-pasting them, I decided just to go for it.
A large part of this decision was based on the realization that roughly 80% of the texts that I owned or wanted to own were available as high-resolution, low file size PDFs from the digital libraries I mentioned above. So it was only the small fraction of my library for which I couldn’t find a decent, preexisting PDF copy that would need to be converted manually. How hard could it be to make a decent PDF? Wouldn’t the time I spent digitizing print books be paltry when compared with the time saved copy-and-pasting quotations?
The whole process turns out to be remarkably simple if not particularly cheap. We must first be willing to spend several hundred dollars on a high-speed, sheet-fed scanner, which entails the willingness to cut some of our beloved books to pieces. (I imagine that for many of us the affective “cost” of the latter far outweighs the monetary cost of the former.) The Fujitsu ScanSnap sheet fed document scanner is capable of scanning books at ~50 pages/minute once their binding has been removed via paper slicer (provided that the pages themselves are not problematically thin, warped or glued). Once scanned, optical character recognition (OCR) enables us to convert any high-quality scan of any conventionally-formatted book into digital text in a matter of minutes with more than 95% accuracy. Any errors or artifacts introduced in the scanning process can then be corrected in Acrobat. This means that, with the necessary equipment, it’s possible for anyone with a modicum of experience to generate a professional-grade PDF of a full-length book in anywhere from 20 minutes to an hour.
I’ve come to think of digitization as a kind of mummification: after the book is gutted – its vital organs, excised and scanned, their digital immortality, assured – the pages are reinserted back into the outer cover. The process is almost undetectable until some unknowing bibliophile plucks a book from the shelf only to have its pages fall out and scatter across the floor which, for better or worse, hasn’t really been much of a problem. (I’ve even begun plastic wrapping stacks of digitized books I don’t want to display in order to prevent them from falling apart.) Another thing worth noting is that it often is not even necessary to purchase new copies at retail value because of the abundance of decommissioned library copies available for a fraction of the price on Amazon. While I am not without qualms about the violence of the digitization process, I think that the very existence of decommissioned library books which, so far as I can tell, were never even read is a more tragic reality than the need to dismember them in order to expedite the digitization process. At least their digital spirit actually gets read. It lives on in an infinitely reproducible, intrinsically shareable form. Despite the mummification, the digital copy is really less mum than ever before.
In retrospect, I can confidently say that the benefits of digitizing my library have more than outweighed the difficulties of learning how to digitize texts. The time saved by keyword searching and extracting citations allows me greater depth of coverage and annotation of each individual text. I have attempted, in the supplemental videos, to capture the nuances of this entire process from start to finish in order to reduce the learning curve for anyone interested in migrating from print to digital text. While it might seem rather mundane, I believe that our lack of awareness about the relative ease of this digitization process is one of the greatest impediments to actualizing some of the utopian visions of collective scholarship in the humanities that still remain ‘theoretical’ almost a century after they were first articulated. These tutorials should, thus, be seen as a practical and political intervention in the mnemotechnical infrastructure that prevails in many of our institutions.