Installation (running your own crawlers)¶
- Make a copy of the
- Add your own crawler YAML configurations into the
- Add your Python extensions into the
srcdirectory (if applicable).
setup.pywith the name of your project and any additional dependencies.
- If you need to (eg. if your database connection or directory structure is different), update any environment variables in the
docker-compose.yml, although the defaults should work fine.
docker-compose up -d. This might take a while when it’s building for the first time.
You can access the Memorious CLI through the
docker-compose run --rm worker /bin/bash
To see the crawlers available to you:
And to run a crawler:
memorious run my_crawler
See Usage (or run
memorious --help) for the complete list of Memorious commands.
Note: you can use any directory structure you like,
config are not required, and nor is separation of YAML and Python files. So long as the
MEMORIOUS_CONFIG_PATH environment variable points to a directory containing, within any level of directory nesting, your YAML files, Memorious will find them.
Your Memorious instance is configured by a set of environment variables that control database connectivity and general principles of how the sytem operates. You can set all of these in the
ALEPH_HOST, default is
https://data.occrp.org/, but any instance of Aleph 2.0 or greater should work.
ALEPH_API_KEY, a valid API key for use by the upload operation.
Shut it down¶
To gracefully exit, run
Files which were downloaded by crawlers you ran, Memorious progress data from the Postgres database, and the RabbitMQ queue, are all persisted in the
build directory, and will be reused next time you start it up. (If you need a completely fresh start, you can delete this directory).
Building a crawler¶
When you’re working on your crawlers, it’s not convenient to rebuild your Docker containers all the time. To run without Docker:
- Copy the environment variables from the
If you leave
MEMORIOUS_DATABASE_URI unset, it will use SQLite. Otherwise you need to set it to match a local Postgres database.
MEMORIOUS_CONFIG_PATH points to your crawler YAML files, wherever they may be.
pip install memorious. If your crawlers use Python extensions, you’ll need to run
pip installin your crawlers directory as well;
- or clone the Memorious repository and run
make install(this will also install your crawlers for you).