Streamsets Data Collector: StreamSets Data Collector is a lightweight and powerful engine that streams data in real time. It allows you to build continuous data pipelines, each of which consumes record-oriented data from a single origin, optionally operates on those records in one or more processors and writes data to one or more destinations. Streamsets Origin Stage: To define the flow of data for Data Collector, you configure a pipeline. A pipeline consists of stages that represents the origin and destination of the pipeline and any additional processing that you want to perform. An origin stage represents the source for the pipeline. For example, this pipeline, based on the SDC taxi data tutorial https://streamsets.com/documentation/datacollector/latest/help/#Tutorial/Overview.html which uses the Directory origin, four processors and the Hadoop File System destination: Stremsets comes bundled with many origin stage components to connect with almost all commonly used data sources and if you don’t find one for your source system, don’t worry Streamsets APIs are…
Month: December 2016
Kafka – A great choice for large scale event processing

Kafka is a highly scalable, highly available queuing system, which is built to handle huge message throughput at lightning-fast speeds. Clairvoyant team has used Kafka as a core part of architecture in a production environment and overall, we were quite satisfied with the results, but there are still a few caveats to bear in mind. The backdrop: For one of the largest health insurance companies in America, we had to build an archival system to house electronic copies of all documents produced for their members, and provide on demand search and retrieval of those documents based metadata and text content. The solution: We built a system where events (which correspond to new documents produced) move through various queues. So, that the system can process them without getting overwhelmed and is scalable to handle huge number of documents that gets produced daily for all their members. Our messaging system of choice…