The infrastructure needs to match the use-cases and characteristics of the data. In one end, we find use-cases with small, non-sensitive data and batch processing. At the other end, we see use-cases with big data, parallel processing, GDPR concerns, and real-time requirements. Data ingestions, storage, pipelines, model training, and prediction serving need to be set up accordingly. Cloud and managed services have simplified operations. We take advantage of this, but still consider the day-to-day management of the systems. Important topics when we build data infrastructure include:

  • Small and big data. Align infrastructure with the amounts of data to be stored and processed, without overcomplicating things. Carefully consider when you require parallel processing and the complexity that comes with it.
  • Technology choices. Select the tech stack and evaluate technologies as the use-cases mature. Choose technologies that are broadly used and supported.
  • Machine learning systems. Novel challenges accompany these systems. Handle model training and serving, management of the models, including validation, qualification, deployment, and monitoring with care.
  • Security and privacy. Set the correct level of security and manage access to data.