OSP (Open Source Publishing) → you’re traveling towards brol/scraper/README in le75

Snapshots | iceberg

Inside this repository

README

text/plain

REQUIREMENTS
------------
python2.7
mongodb

INSTALLATION
------------
Probably best to run it with a virtual environment

virtualenv venv
. ven/bin/activate
pip install -r requirements.txt

To fill the database:
./fill_db.py

USAGE
-----
To run the scraper:
./scraper.py

The scraper should avoid to insert the same content twice by hashing the scraped HTML

To run the 'app':
./app.py

Visit the 'app' with the browser:
http://localhost:7575
Overview of HTML in the DB

http://localhost:7575/p
Overview of all the p-tags in the DB

http://localhost:7575/texts
Overview of all the text-strings in the DB

http://localhost:7575/images
Only the images from the DB

http://localhost:7575/combined
Text overlayed over the first image

DB
--
For now I've used a MongoDB, but there's no real reason for it. Except for it's schema-less-ness makes it pretty flexible for now.

There are two collections: pages, snippets.

Pages contains the pages to scrape, and the selectors to use on the page. All 'snippets' (perhaps chunks is better ?) found by the selectors are stored in the snippets collection. By hashing the content of the snippet I try to avoid double insertion.

Open Source Publishing

le75 clone your own copy | download snapshot

Snapshots | iceberg

Inside this repository

le75
clone your own copy | download snapshot