You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
bathyscaphe/README.md

67 lines
1.8 KiB
Markdown

3 years ago
# Bathyscaphe dark web crawler
![CI](https://github.com/creekorful/bathyscaphe/workflows/CI/badge.svg)
4 years ago
3 years ago
Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler.
# How to start the crawler
To start the crawler, one just need to execute the following command:
```sh
3 years ago
$ ./scripts/docker/start.sh
```
and wait for all containers to start.
## Notes
4 years ago
- You can start the crawler in detached mode by passing --detach to start.sh.
- Ensure you have at least 3 GB of memory as the Elasticsearch stack docker will require 2 GB.
4 years ago
# How to initiate crawling
4 years ago
3 years ago
One can use the RabbitMQ dashboard available at localhost:15003, and publish a new JSON object in the **crawlingQueue**
3 years ago
.
4 years ago
The object should look like this:
```json
3 years ago
{
"url": "https://facebookcorewwwi.onion"
3 years ago
}
```
## How to speed up crawling
3 years ago
If one want to speed up the crawling, he can scale the instance of crawling component in order to increase performances.
This may be done by issuing the following command after the crawler is started:
```sh
$ ./scripts/docker/start.sh -d --scale crawler=5
```
this will set the number of crawler instance to 5.
4 years ago
# How to view results
4 years ago
3 years ago
You can use the Kibana dashboard available at http://localhost:15004. You will need to create an index pattern named '
resources', and when it asks for the time field, choose 'time'.
4 years ago
# How to hack the crawler
3 years ago
If you've made a change to one of the crawler component and wish to use the updated version when running start.sh you
just need to issue the following command:
```sh
$ goreleaser --snapshot --skip-publish --rm-dist
```
this will rebuild all images using local changes. After that just run start.sh again to have the updated version
3 years ago
running.
# Architecture
The architecture details are available [here](docs/architecture.png).