Machine Learning Basic – Calculate Euclidean Distance by PowerShell

The core of Machine Learning is to find out rules in a set of data. One basic operation of Machine Learning is “Cluster”. Or simply call it “classify data”. For example, there are 1000 records of toy sales data. It will be useful for proactive new incoming customer’s behavior if we can classify the data to multiple groups (Such as “buy stuffed toys” group and “buy electronic toys” group).

So we need to leverage partition methods to classify the sales data. There are multiple ways to do so. One simple method call “K-Means“. It calculates the distance between each data point and centroids ( Center point of a group ). And then assign data points to the closest centroids. Wikipedia has a detail description of the method.

Hence, as you can see, the key to “K-Means” is to calculate distance. There are several ways of calculation. “Euclidean Distance” is one way. Please refer to Wikipedia for deep dive. Long to short, you need to distribute data to a 2D axis. Each data point has x and y value. “Euclidean Distance” between two data points is:

The formulation is simple, but you have to calculate the distance between each data point to every centroids. Following is a super simple PowerShell code to help calculate Euclidean Distance of a 3 clustered data.

$K1 = (3.67, 9)
$K2 = (7, 4.33)
$K3 = (1.5, 3.5)

$input = Import-Csv "c:\temp\input.txt"

$i = 1
foreach ($seed in $input){
    $K1Result= [math]::Sqrt([math]::pow(($K1[0]-$seed.x),2)+[math]::pow(($K1[1]-$seed.y),2))
    $K2Result= [math]::Sqrt([math]::pow(($K2[0]-$seed.x),2)+[math]::pow(($K2[1]-$seed.y),2))
    $K3Result= [math]::Sqrt([math]::pow(($K3[0]-$seed.x),2)+[math]::pow(($K3[1]-$seed.y),2))
    Write-Host "K1 to  A$i distance is $K1Result"
    Write-Host "K2 to  A$i distance is $K2Result"
    Write-Host "K3 to  A$i distance is $K3Result"
    $i++
}
PowerShell

This script assumes you want to partition records to a set of 3 clusters (K1, K2, K3). $K1, $K2, and $K3 are centroids of each cluster (group). You can adjust it according to your purpose.

The script loads records in “input.txt” file then calculates Euclidean Distance of each record. Each record in “input.txt” only has x and y value. Following is a sample of “input.txt“. You can copy it for testing.

x,y
2, 10
2, 5
8, 4
5, 8
7, 5
6, 4
1, 2
4, 9

Following is the result:

K1 to A1 distance is 1.9465096968677
K2 to A1 distance is 7.55968914704831
K3 to A1 distance is 6.51920240520265
K1 to A2 distance is 4.33461647669087
K2 to A2 distance is 5.04469027790607
K3 to A2 distance is 1.58113883008419
K1 to A3 distance is 6.61429512495474
K2 to A3 distance is 1.05304320899002
K3 to A3 distance is 6.51920240520265
K1 to A4 distance is 1.66400120192264
K2 to A4 distance is 4.17958131874474
K3 to A4 distance is 5.70087712549569
K1 to A5 distance is 5.20469979921993
K2 to A5 distance is 0.67
K3 to A5 distance is 5.70087712549569
K1 to A6 distance is 5.5162396612185
K2 to A6 distance is 1.05304320899002
K3 to A6 distance is 4.52769256906871
K1 to A7 distance is 7.49192231673554
K2 to A7 distance is 6.43652856748108
K3 to A7 distance is 1.58113883008419
K1 to A8 distance is 0.33
K2 to A8 distance is 5.55057654663009
K3 to A8 distance is 6.04152298679729