REQUIREMENTS ------------ python2.7 mongodb INSTALLATION ------------ Probably best to run it with a virtual environment virtualenv venv . ven/bin/activate pip install -r requirements.txt To fill the database: ./fill_db.py USAGE ----- To run the scraper: ./scraper.py The scraper should avoid to insert the same content twice by hashing the scraped HTML To run the 'app': ./app.py Visit the 'app' with the browser: http://localhost:7575 Overview of HTML in the DB http://localhost:7575/p Overview of all the p-tags in the DB http://localhost:7575/texts Overview of all the text-strings in the DB http://localhost:7575/images Only the images from the DB http://localhost:7575/combined Text overlayed over the first image DB -- For now I've used a MongoDB, but there's no real reason for it. Except for it's schema-less-ness makes it pretty flexible for now. There are two collections: pages, snippets. Pages contains the pages to scrape, and the selectors to use on the page. All 'snippets' (perhaps chunks is better ?) found by the selectors are stored in the snippets collection. By hashing the content of the snippet I try to avoid double insertion.