0
down vote
favorite
I need an explanation about a certain matter. While data classification process in python, before training the classifier data is divided into training samples where the label class is the one we are going to identify is removed before training. For example, for classifying yeast data, where "class" is the label, I do things this way:
headers = ["name", "mcg", "gvh", "alm", "mit","erl", "pox", "vac", "nuc","class"]
df = pd.read_csv("yeast.data", header=None, names=headers, na_values="?"
X = np.array(df.drop(['class'], 1))
y = np.array(df['class'])
knn = NearestNeighbors(n_neighbors=6, algorithm='ball_tree', metric='euclidean')
knn.fit(X)However, in the R language, while distance calculation the label class is also considered. For example for the same dataset, while distance calculation in R, things goes this way:
df <- read.table(file="~/yeast.txt",header=T, sep=","
names(df) <- c("name", "mcg", "gvh", "alm", "mit","erl", "pox", "vac", "nuc","class"
dist <- distances("class",df, "Euclidean"Here, we are needed to add the label class too. Could someone explain me the reason? Am I doing something wrong?