ent maia

Posts by:

jcsladcik

Retuning the Heavens: Machine Learning and Ancient Astronomy

What can we learn about machine learning from ancient astronomy?

When thinking about Machine Learning it is easy to be model-centric and get caught up in the details of getting a new model up and running: preparing a dataset for machine learning, partitioning the training and test data, engineering features, selecting features, finding an appropriate metric, choosing a model, tuning the hyper-parameters. Being model-centric is reinforced by the fact that we don’t always have control of the data or how it was collected. In most cases, we are presented with a dataset collected by someone else and are asked what we can make of it. As a result, it is easy to just accept the data and over-fit your thinking about machine learning to the specifics of your modeling process and experience. Sometimes it is a good idea to step away from these details and remind yourself of the basic components of a model and its data, how they interact with each other, and how they evolve.

Read More

Extracting Target Labels from Deep Learning Classification Models

In the blog post Configuring a Neural Network Output Layer we highlighted how to correctly set up an output layer for deep learning models. Here, we discuss how to make sense of what a neural network actually returns from the output layers. If you are like me, you may have been surprised when you first encountered the output of a simple classification neural net.

Read More

Choosing the Right Number of Clusters

Introduction

When I first started my machine learning journey, K-means clustering was one of the first algorithms I was introduced to – and it is still one of my favorites to this day. I was amazed at how elegant yet comprehensible the procedure was. There is something oddly satisfying about watching the cluster assignments and centroids being updated with each iteration. While K-means clustering has been tried and true since its inception in the 1950s, there is still one foundational requirement for employing this method: choosing the correct number of clusters – the K in K-means. In this month’s newsletter, we’ll explore a technique known as the elbow method to help determine the ideal number of clusters that should be chosen for a given clustering task. To conclude, we will explore another type of clustering algorithm (Affinity Propagation clustering) that does not require a predetermined number of clusters for execution. 

Read More

Prospecting for Data on the Web

Introduction

At Enthought we teach a lot of scientists and engineers about using Python and the ecosystem of scientific Python packages for processing, analyzing, and visualizing data. Most of what we teach involves nice, clean data sets–collections of data that have been carefully collected, scrubbed, and prepared for analysis. While we also mention in passing the idea of collecting data from the web, work a few examples of general data cleanup, and at least show our students each of the tools needed, we seldom have enough time in class to follow a complete, practical example of web data prospecting from end to end. This newsletter should help remedy that.

The Problem

While the internet is a great resource for many things, including data, the web’s wild and tangled nature presents a few problems:

Read More