You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
Go to file
Aloïs Micard 26eb71e577 Reorganise CI 10 months ago
.github/workflows Reorganise CI 10 months ago
build/docker Fix (again) CD pipeline 3 years ago
cmd s/creekorful/darkspot(-org)/ 3 years ago
deployments Use my personal docker account 10 months ago
docs Big improvements 3 years ago
internal Release 1.0.0 3 years ago
scripts remove scale.sh script 3 years ago
.dockerignore Start implementing new architecture 4 years ago
.gitignore Add back continuous delivery 3 years ago
.goreleaser.yaml Use my personal docker account 10 months ago
CHANGELOG.md Use my personal docker account 10 months ago
LICENSE Initial commit 4 years ago
README.md Use my personal docker account 10 months ago
go.mod s/creekorful/darkspot(-org)/ 3 years ago
go.sum Rename project 3 years ago

README.md

Bathyscaphe dark web crawler

CI

Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler.

How to start the crawler

To start the crawler, one just need to execute the following command:

$ ./scripts/docker/start.sh

and wait for all containers to start.

Notes

  • You can start the crawler in detached mode by passing --detach to start.sh.
  • Ensure you have at least 3 GB of memory as the Elasticsearch stack docker will require 2 GB.

How to initiate crawling

One can use the RabbitMQ dashboard available at localhost:15003, and publish a new JSON object in the crawlingQueue .

The object should look like this:

{
  "url": "https://facebookcorewwwi.onion"
}

How to speed up crawling

If one want to speed up the crawling, he can scale the instance of crawling component in order to increase performances. This may be done by issuing the following command after the crawler is started:

$ ./scripts/docker/start.sh -d --scale crawler=5

this will set the number of crawler instance to 5.

How to view results

You can use the Kibana dashboard available at http://localhost:15004. You will need to create an index pattern named ' resources', and when it asks for the time field, choose 'time'.

How to hack the crawler

If you've made a change to one of the crawler component and wish to use the updated version when running start.sh you just need to issue the following command:

$ goreleaser --snapshot --skip-publish --rm-dist

this will rebuild all images using local changes. After that just run start.sh again to have the updated version running.

Architecture

The architecture details are available here.