You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
45 lines
802 B
Markdown
45 lines
802 B
Markdown
# Crawler
|
|
|
|
The crawler is the central process of Trandoshan.
|
|
It consumes URL, crawl them and publish the page body while following redirects etc...
|
|
|
|
## Consumes
|
|
|
|
- URL (url.todo)
|
|
|
|
## Produces
|
|
|
|
- Resource (resource.new)
|
|
|
|
# Extractor
|
|
|
|
The extractor is the data extraction process of Trandoshan.
|
|
It consumes crawled resource, extract data (urls, metadata, etc...) from it,
|
|
store them into an ES instance (by calling the API), & publish found URLs.
|
|
|
|
## Consumes
|
|
|
|
- Resource (resource.new)
|
|
|
|
## Produces
|
|
|
|
- URL (url.found)
|
|
- Metadata
|
|
- Body
|
|
|
|
# Scheduler
|
|
|
|
The scheduler is the process responsible for crawling schedule part.
|
|
It determinates which URL should be crawled and publish them.
|
|
|
|
## Consumes
|
|
|
|
- URL (url.found)
|
|
|
|
## Produces
|
|
|
|
- URL (url.todo)
|
|
|
|
# API
|
|
|
|
The API process is mainly used to get data from ES. |