Snapshot

In this portfolio project, I re-developed web crawler using the Scrapy framework. The goal was to create a tool that could navigate Drupal sites built on the Edweb distribution, providing insights into page details like structure, age, and cookies. Additionally, the crawler generated a site map connecting all pages within the site.

One significant task involved updating the existing scraper from Python 2 to Python 3. I also replaced a PHP script responsible for inserting CSV data into an SQL database. The original process used a CSV intermediary, which was later replaced with direct data insertion using Scrapy's pipelines into the SQL database.

The project also included building a Flask API to retrieve data from the SQL database, and a PHP front-end to serve as the Snapshot application interface. I refined the Flask application to use Jinja2 templates for rendering, eliminating the need for separate API endpoints.

To enhance reliability, I implemented additional error handling, making it more resilient to crashes or failing during the scraping process. I expanded its capabilities to cover additional University of Edinburgh domains, like the School of Engineering and Business School, effectively broadening its scope.