You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

802 B

Raw Blame History

Crawler

The crawler is the central process of Trandoshan. It consumes URL, crawl them and publish the page body while following redirects etc...

Consumes

URL (url.todo)

Produces

Resource (resource.new)

Extractor

The extractor is the data extraction process of Trandoshan. It consumes crawled resource, extract data (urls, metadata, etc...) from it, store them into an ES instance (by calling the API), & publish found URLs.

Consumes

Resource (resource.new)

Produces

URL (url.found)
Metadata
Body

Scheduler

The scheduler is the process responsible for crawling schedule part. It determinates which URL should be crawled and publish them.

Consumes

URL (url.found)

Produces

URL (url.todo)

API

The API process is mainly used to get data from ES.

802 B Raw Blame History

Crawler

Consumes

Produces

Extractor

Consumes

Produces

Scheduler

Consumes

Produces

API

802 B

Raw Blame History