From Audience Research to Web Archives: What a Media Studies PhD Learned at the Early Scholars Spring School - Pôle Bibliothèques et Archives de la MMSH

By Seyf-el Islem Nader

The Early Scholars Spring School on Web Archives was held for the second time at the KBR Museum in Brussels on April 20^th, 2026, after a first successful session in 2024 in Paris. Both editions of the event took place during the annual web archiving conferences (WAC) organized by the International Internet Preservation Consortium (IIPC). This year’s edition was organized by digital historians Valérie Schafer from the University of Luxembourg and Ian Milligan from the University of Waterloo, as well as Emmanuelle Bermes from the École nationale des chartes PSL (Université Paris lettres et sciences). The concept aims to engage early scholars interested in web archives in various discussions about the field. As a 2^nd year PhD student from Aix-Marseille Université, I was one such early scholar—albeit with an indirect connection to web archiving—so I signed up for the event. The insights that this Spring School provided complemented a series of answers to questions that sometimes I was not even aware I had, throughout my PhD research.

In this blogpost, I will share several reflections from my experience at the Early Scholar Spring School, as well as the questions that pushed me to consider exploring at the end. But first, it is important to lay out the perspective from which I approached this Spring School. My research is not on web archives, nor within history more broadly. My PhD thesis is in the field of Anglophone media and cultural studies, specifically on the impact of American science-fiction adaptations on audiences’ perceptions of ecological issues. I have only learned about the existence of this Spring School through a doctoral course I enrolled in, titled “Le Web comme source ou terrain en SHS : apprendre à utiliser le web présent ou archivé dans le cadre de son doctorat” (The Web as a Source or Field in Social Science and Humanities: Learning to use the live or archived web in doctoral research). This series of workshops is organized by Sophie Gebeil, a contemporary historian from TELEMMe laboratory, as part of the courses offered by l’Ecole Doctorale 355 (Doctoral School 355: Spaces, Cultures, Societies), within the broader activities of the WebLab, a space co-created by Sophie Gebeil and Jean-Christophe Peyssard and dedicated to the study of the archived web and new media at Aix-Marseille Université. The wide scope of the course’s title is what attracted me to it since, to do my audience reception analysis, I decided to computationally analyze a large corpus of online reviews, and these workshops offered an opportunity to train and gain familiarity with knowledge and methods involved in using the web in research.

Thus, my perspective within this Spring School and the conference in general aligns with that of researchers with a connection to the web more broadly—not web history or web archives specifically. Just like the WebLab course, I was impressed by the range and variety in the profiles that attended this Spring School. Attendees worked on projects with various relationships to web archives, from those who actively participate in web archiving and web historians, to researchers working more on the analysis of the live web including myself. One of the activities in the program had even included discussions within subgroups that were thematically arranged around:

Archival practices
Sensitive and sensible archives
Data, context, and reception (the group within which I participated)

Beyond introducing us to thematically similar projects, these discussions allowed us to raise questions that directly relate to our particular connection to web archives. As my focus was on computational analysis and data handling, a question that interested me particularly was on the technical skills required to work within web archives. This echoed a similar question I was asked within the WebLab course, as I was able to produce a corpus—a collection of online film reviews—that I could only compile using some tools I developed via Python. My answer to this question was tentatively affirmative, as I can only be certain about the necessity of developing technical skills in my specific engagement with web archiving, not to the field as a whole. In my case, there was no real way around learning the basics of coding that would allow me to scrape the reviews I needed from the film review website I was studying, notably IMDB (Internet Movie Database). The process involved creating a script that can reliably scrape all of the reviews found in a film’s webpage. I designed it so that it could also capture the metadata that comes with the reviews needed for my research (usernames, display names, number of likes, date of publication, date of collection). This required the study of the website’s HTML to target the data I need, as well as Python libraries (like Selenium) which helped me collect the data in real-time by simulating a browser activity. The tool functioned similarly to Browsertrix that was introduced in the WebLab course with the exception that it collects specific data from a webpage (in CSV form) instead of the entire page. The script can work with any film on IMDB and is shared on GitLab via this link.

The collective discussion we had next in the Spring School helped complement the answer. The organizers of the event agreed that regardless of the specific needs of a project, a minimum level of technical proficiency seems to be inevitable as long as we work with the web. This, it was argued, is primarily imposed by the unstable nature of the web that makes it challenging to provide accessible tools that can be used universally and reliably. This was indeed behind my earlier hesitant conclusion on the matter, as the challenges I have faced in my own collection process—wide variety of web structures, dynamic mechanisms, platform changes, protection measures, and more—make it nearly impossible to apply my tools beyond my specific use case and web source, at this particular time. Not all academic engagement with the web requires technical proficiency at this level, but we agreed that there needs to be a minimum and, more importantly, a willingness to adapt to the web’s instability.

On this note, the question of artificial intelligence, particularly LLMs, was also raised. Professor Valérie Schaffer argued that such tools are seriously limited as “substitutes” for the acquisition of the technical skills required in the field. She cited the aforementioned instability factor but also the vulnerability and even danger of relying on tools, the mechanisms of which we do not fully understand. This echoed the perspective I already had on the issue. From my own experience, AI tools can at best serve as assistants in developing the technical skills required in the field, but such a learning process is absolutely fundamental and cannot be replaced by AI.

The question of artificial intelligence was mentioned earlier in our roundtable discussion following the BelgicaWeb Symposium that took place in the morning. This concluding symposium presented the results of the two-year project that aims to make Belgium’s digital heritage accessible and FAIR (Findable, Accessible, Interoperable, and Reusable). The panel consisted of experts from different fields—Digital Humanities, Computer Science, Law, and Library Practice—who walked us through the different aspects of the project as well as their workflow from selection of the archived content all the way to data enrichment.

In contrast to our earlier point of discussion, this symposium made it a point to go beyond the technical dimension of their project and so emphasized how it is also a legal, institutional, and user-oriented archiving process. This, as stated by the project’s legal expert Elodie Lecroit, was mainly driven by two factors: one is that, unlike France, Belgium lacks a clear legal-deposit framework for digital content in which a more mature web-archiving regime can be anchored, and another is that the European copyright exception for preservation by cultural heritage institutions cannot be invoked if the protected content is not already part of the institution’s collection. This apparently has pushed the project to create collections with a controlled scope and metadata (around 2000 URLs only). The question of artificial intelligence arose as the team explained their use of LLMs in the process of data enrichment of the metadata of the collections. This invited a discussion on the role of AI in web-archiving, in which attendees suggested its use in the description of images as another form of metadata, or as a tool on the user’s end that helps them navigate the collected archives.

These discussions helped me put into perspective what I have learned so far from my courses at the WebLab. On the methodological front, I could closely see how a project on web archiving is handled in more practical terms. For example, the panel explained that they used Browsertrix in conserving the websites needed for their collections. This illustrated to me the utility of such tools in the work of archiving organizations on a global level, well beyond a more limited use for a small PhD research project. That is not to say it was not insightful in that regard either. The next step in my thesis project is to analyze the textual data of film reviews I have gathered, and this panel’s workflow brought to my attention that enhancing the gathered data should be part of the process.

Conversely, the legal and institutional web archiving scene in Belgium helped me develop a fuller understanding of the challenges that the field faces on an international level. The stark contrast in scope between the collections of this project and web archiving practices in France, mainly the legal deposit by the Bibliothèque nationale de France (BnF) or L’institut National de l’audiovisuel (INA), has clearly shown that we are not yet there in terms of a solid web archiving infrastructure across different countries. We could indeed see a wide variety and inconsistency in the maturity of the archiving frameworks through not only this France/Belgium comparison but also between other countries that were present in the Spring School, namely Canada, Luxemburg and Switzerland.

This difference in scope has also raised the question of selection in our later discussions. The small number of 2000 URLs may have been defined by what BelgicaWeb considered as highly relevant to Belgian cultural heritage, but we all agreed that the criteria to decide what websites or webpages meet that “high relevance” standard is difficult to fairly establish and is likely to result in flawed or incomplete data. In fact, through this discussion, we realized that such an endeavor for completeness might inevitably be challenging regardless of scope. For instance, Marina Hervieu, a research engineer working on the SkyTaste project at the École nationale des chartes, has highlighted such limitations that were imposed on BnF in their Skyblog collections. This is a well-known French blogging platform that shut down in 2023, after which the BnF archived 12.6 million blogs. It is a substantial archive that was informed by prior knowledge of the platform’s shutdown and aimed at producing collections that are as complete as possible. However, SkyTaste researchers, who aim to study the role of the platform in the development of digital culture in France, underline the importance of taking into account the choices made by archiving institutions during the collection process, such the limiting the web elements to be compiled, prioritizing certain content, or reconfiguring the structure of webpages to technically facilitate the collection process. All of these, Marina argues, are choices that impact the degree of faithfulness to the live web. Under this light, we have agreed that it is nearly impossible to claim exhaustivity in relation to web archives, regardless of how massive they seem, as in the end, they are only representative of a specific part of the web space and almost never in its entirety.

At first glance, these selection issues did not seem to concern my specific project. After all, I have already collected all of the reviews that I needed on the specific films I am studying, from one platform at least. This gave me the impression that my collection was nearly complete. However, since the collection took place, more reviews have been written, some have been altered, and some have even been deleted from the platform. The webpages I scraped a few months ago are not exactly the same now and will increasingly change with time. Again, we agreed that there is probably no solution to this problem, and we should accept that collections cannot be perfect and complete. Instead, we must work on identifying and acknowledging the weaknesses and limitations of our criteria and methods. This interaction with different use cases of web archives and the challenges they face has put the insights I have learned at the WebLab more clearly into perspective, particularly on the necessity of rigorously framing our methodologies, scope, and limitations as well as ensuring adherence to reference and stabilization standards.

These reflections helped us transition toward the final activity of the Spring School, which was to collectively create the first sketch for a poster that would then be refined throughout the conference. The theme of the poster was sustainability, so we had to design it in this context. Inspired by the discussions we had earlier, we decided to integrate questions that may remain unanswered into our poster. One of these questions was the possibility to develop sustainable—in the sense of stable and reproducible—web archiving tools that take into account questions of selection, representation, sensitivity and sensibility of archives…etc. Another dimension we considered was working conditions and how we can make them sustainable for all of the involved actors in web archiving. A final axis of sustainability included was the environmental dimension, though it was perhaps the least developed during our discussions, beyond the ecological implications that come with using AI in web archiving. Aesthetically, this environmental dimension inspired the attendees to design the poster in the form of various plants, with questions underneath to mirror their roots and possible answers that mirror the blossoming flowers and trees that were drawn.

Thus, to an early scholar with the perspective detailed above, this Spring School has provided me with two different kinds of insight. The first one is broad, and perhaps less immediately “useful” to my thesis but nonetheless stimulating intellectually. Thanks to the collective discussions we had and direct contact with different actors in the field, I have gained a better understanding and a fuller picture of how web archiving operates on an international, institutional, legal and technical level. The second one is more personal: I have confirmed that researchers such as myself can benefit greatly from engaging with the field of web archives due to the shared questions and frameworks that come with working on the web even remotely. In my specific case, I have learned that my field of media and cultural studies, particularly reception studies, needs to engage with web archiving as its subject of study, popular culture and the discourse around it, is increasingly moving to the digital realm. It is thus in the field’s interest that the existing web archiving structures mature and expand in light of these needs. On this note, one important question that I had coming out of Spring School was on the role of different scholars in archiving the web. At some point in our discussion, Emmanuelle Bermes described those of us working on the live web as doing our own kind of localized and controlled web archiving. Indeed, this is a question I will explore in the future through my PhD research as a case study: to what extent can we consider individual researchers who engage with the web and, along the way, conserve different parts of it for purposes other than archiving it, as contributors to the process of web archiving? This question pushes me to seriously consider establishing a system that allows me to conserve my corpus, perhaps both in the form of data as well as the entire webpages that would visualize and demonstrate the full configuration of the platforms I worked on. This also compels me to explore the options available to me in terms of the possibility of depositing my corpus and making it accessible for future research. In short, my entire journey from the WebLab courses to this Spring School brought to my attention possibilities of solidifying my research methods and potentially contributing to a worldwide effort to conserve the web, that I would not have otherwise taken into consideration.

Laisser un commentaire Annuler la réponse