Data Adapters

Making Data Accessible for Machine Learning

Obtaining large, high-quality, and labeled datasets to train algorithms is by far the biggest challenge in machine learning. Access alone can be hard! Data scientists working in enterprise are often confronted with inconsistent data siloed across multiple, legacy databases and stored in various formats. They face technical and political constraints in accessing that data. Skymind helps solve this problem by making it easy to plug ML pipelines into many common data sources, providing a common interface for sharing and validating data for downstream training and modeling.

For Data Engineers


Data myData = Data.getByName("my_data_name")
RDD[DataSet] fullData = mydata.getRDD() # Full Production Data
... (Build Production ETL Pipeline)

For Data Scientists


myData = Data.getByName("my_data_name")
dataFrame = myData.sample(rate=0.1, seed=42) # Get 10% of data for local experimentation
... (Train Model, etc.)

Large organizations are under more and more pressure to transform their businesses digitally. They face competition from fast-moving startups and well-funded technology giants like Amazon. But many companies find it difficult to re-architect legacy systems, slowing their ability to leverage new technologies like machine learning. As a result, their data science teams struggle to access the data necessary to train machine learning models, and enterprise is unable to achieve the accuracy, performance and innovation possible with the latest AI. 

Skymind helps solve this problem with database connectors - a tool for tapping into existing data on legacy stacks, and for consolidating and transforming that data into a format necessary for machine learning. These connectors consolidate data from various legacy databases, apply transformations, and expose it in a format that machine learning algorithms can ingest and learn from.

Key Features

  • A common interface.
    • Developers don't need to learn how to access every different storage system and manually translate data to something useful for their tools.
    • Changes to backend infrastructure, database schema or storage location (local to cloud, for example) no longer disrupt Data Science teams.
  • Access control: provide access only to approved teams/personnel and monitor access patterns.
  • Bindings to integrate with common machine learning, data science, and big data tools.
  • Metadata tracking for datasets for reproducibility and compliance.