Machine learning Interactive Decision Tree Classifier

Can anyone recommend a decision tree classifier implementation, in either Python or Java, that can be used incrementally? All the implementations I've found require you to provide all the features to the classifier at once in order to get a classification. However, in my application, I have hundreds of features, and some of the features are functions that can take a long time to evaluate. Since not all branches of the tree may use all the features, it doesn't make sense to give the classifier a

Machine learning RMSE in Naive Bayes Classifier

I have a very basic question about calculating RMSE in an NB classification scenario. My training data X has some 1000-odd reviews with ratings in [1,5] which are the class labels Y. So what I am doing is something like this: model = nb_classifier_train(trainingX,Y) Yhat = nb_classifier_test(model,testingX) My testing data has some 400-odd reviews with missing ratings (whose labels/ratings I need to predict. Now to calculate RMSE RMSE = sqrt(mean((Y - Yhat).^2)) What is the Y in this scen

Machine learning Bayes Net classifier for a large set

So here's another difficult question. I'm looking for the equivalent of Weka's Bayes Net classifier. Notice that it is different from Naive Bayes. Problem with Weka is that it uses too much memory and so cannot handle large data sets. Needs to handle a few million examples set, work on Windows.

Machine learning Is there a machine learning algorithm which successfully learns the parity function?

The parity function is a function from a vector of n bits and outputs 1 if the sum is odd and 0 otherwise. This can be viewed as a classification task, where the n input are the features. Is there any machine learning algorithm which would be able to learn this function? Clearly random decision forests would not succeed, since any strict subset of features has no predictive power. Also, I believe no neural network of a fixed depth would succeed, since computing the parity function is not in the

Machine learning libsvm returning a trivial solution

I'm trying to use libsvm to classify data as seen in the following picture: You can see "by eye" there is a soft separation between blue and red, but some blue samples exists throughout the entire area I would say "should be tagged red". I can't get libsvm to return a meaningful classification and keep getting the trivial one- all dots tagged blue. This happens with various kernels and parameter values. I think playing with the cost variable don't solve this, because there are 10-fold more b

Machine learning Machine Learning Algorithm selection

I am new in machine learning. My problem is to make a machine to select a university for the student according to his location and area of interest. i.e it should select the university in the same city as in the address of the student. I am confused in selection of the algorithm can I use Perceptron algorithm for this task.

Machine learning The confidence level of each specific instance in WEKA?

I'm new to WEKA and machine learning in general. I have a test set with about 6500 instances. I have a model that has already been trained with a training set. Once I run the test set through the saved model, is there a way I can extract the confidence level of each specific instance? By confidence level, I mean a numerical value that expresses the probability that the classifier has classified a specific instance correctly. I want this confidence number for each instance in the file. Is there

Machine learning Weka GUI tool output in a Java snippet

I am using the Weka tool for SMO in machine learning. How could I generate the predictions in a Java snippet? I get the following output when I use the GUI tool with "buildLogisticModels" set to true. Which method in the Evaluation class generates the output? Please see below the sample where col1 - instance #; col2 = actual; col3 = predicted; col4 = error; col5 = prediction: 1 1:1 1:1 0.781 2 1:1 1:1 0.644 3 1:1 1:1 0.742 4

Machine learning File format for classification using SVM light

I am trying to build a classifier using SVM light which classifies a document in one of the two classes. I have already trained and tested the classifier and a model file is saved to the disk. Now I want to use this model file to classify completely new documents. What should be the input file format for this? Could it be plain text file (I don't think that would work) or could be it just plain listing of features present in the text file without any class label and feature weights (in that case

Machine learning How to finding relation between input parameter and output parameter by machine learning?

I have 20 numeric input parameters (or more) and single output parameter and I have thousands of these data. I need to find the relation between input parameters and output parameter. Some input parameters might not relate to output parameter or all input parameters might not relate to output parameter. I want some magic system that can statistically calculate output parameter when I provide all input parameters and it much be better if this system also provide confident rate with output result.

Machine learning How to train neural network on large training set and small memory

I write my own neural net library with backpropagation using gpu computing. Want to make it universal, that I dont must check if the training set fits to the gpu memory. How do you train a neural net, when the training set is too large to fit in gpu memory? I assume that it fits in RAM of the host. Must I do the train iteration on the firts piece, then deallocate it on the device and send the second piece to the device and train on that, so on ... And then sum up the gradient results. Is it n

Machine learning enlarging a text corpus with classes

I have a text corpus of many sentences, with some named entities marked within it. For example, the sentence: what is the best restaurant in wichita texas? which is tagged as: what is the best restaurant in <location>? I want to expand this corpus, by taking or sampling all the sentences already in it, and replacing the named entities with other similar entities from the same types, e.g. replacing "wichita texas" with "new york", so the corpus will be bigger (more sentences) and more c

Machine learning Binary classification of dated documents with seasonal class variation

I have a collection of training documents with publication dates, where each document is labeled as belonging (or not) to some topic T. I want to train a model that will predict for a new document (with publication date) whether or not it belongs to T, where the publication date might be in the past or in the future. Assume that I have decomposed each training document's text into a set of features (e.g., TF-IDF of words or n-grams) suitable for analysis by an appropriate binary classification a

Machine learning Which Machine Learning technique is most valid in this scenario?

I am fairly new to Machine Learning and have recently been working on a new classification problem to which I'm giving the link below. Since cars interest me, I decided to go with a dataset that deals with the classification of cars based on several attributes. Now, I understand that there might be a number of ways to go about this particular case, but the real issue here is - Which particular algorithm might be most effective? I am consid

Machine learning Increase or decrease learning rate for adding neurons or weights?

I have a convolutional neural network of which I modified the architecture. I do not have time to retrain and perform a cross-validation (grid search over optimal parameters). I want to intuitively adjust the learning rate. Should I increase or decrease the learning rate of my RMS (SGD-based) optimiser if: I add more neurons to the fully connected layers? on a convolutional neural network, I remove a sub-sampling (average or max pooling) layer before the full connections, and I increase the

Machine learning Finetune a Caffe model in Torch

I want to finetune a neural net that has been pretrained. However, this model was made in Caffe and I would like to work in Torch. I have tried loadcaffe, but this does not seem focused on finetuning. Is there another tool that makes this possible? Or can the Caffe model be converted to a Torch net?

Machine learning tensorflow distributed training hybrid with multi-GPU methodology

After playing with the current distributed training implementation for a while, I think it views each GPU as a separate worker.However, It is common now to have 2~4 GPUs in one box. Isn't it better to adopt the single box multi-GPU methodology to compute average gradients in single box first and then sync up across multiple nodes? This way it ease the I/O traffic a lot, which is always the bottleneck in data parallelism. I was told it's possible with the current implementation by having all GPU

Machine learning better results from simple linear regression than multivariate/multiple reg

I have an existing model that predicts house prices, that uses simple linear regression. As an input I have date and output is price. I wanted to improve overall results so I have added one more feature. New feature is distance from the estimated property. Problem is that the multiple/multivariate regressions performs a bit worse than the simple regression. (All the data are normalised) Do you have some ideas why is this happening and how can I approach this?

Machine learning What is the most optimal approach for training Inception v3 on a multi-label dataset like CIFAR-100?

I am trying to train Inception v3 on CIFAR-100 dataset, where each image belongs to a class and a superclass as indicated by its "fine" label and "coarse" labels. My goal is to get the model to infer the image class and with greater accuracy its superclass. For example, predict with 95% certainty that a photo of an apple contains "fruit and vegetables" and with 80% certainty that it contains "apples". What would be the most optimal approach assuming I already have generated a TFRecords file wit

Machine learning What machine learning paradigm / algorithm can I use to select from a pool of possible choices?

I have a large question bank and students. The goal is to select questions for an exam for a student. Questions have various properties: Grade Level Subjects (could be multiple: fractions, word problems, addition) How other students did on this question (percent right, wrong, etc) Has the student seen this question before or those like it? So I want to choose questions for a student based on how the student is doing. My feedback for whether or not it's a "good" exam is the following: Huma

Machine learning Backpropagation (through convolutional layer) and gradients in CNN

I am learning about using Convolutional Neural Networks and I went on to write my own framework for those. I am stuck at the part where I have to backpropagate errors (deltas) through the network and calculate gradients. I am aware that the filters in CNNs are 3D, so we have width, height and depth of some filter. Feed forward is fine. Let's look at the formula for calculating the output of some layer in feed forward step: The depth of the filter in layer l should be the same as the number of

Machine learning Should I use the same training set from one epoch to another (convolutional neural network)

From what I know about convolutional neural networks, you must feed the same training examples each epoch, but shuffled (so the network won't remember some particular order while training). However, in this article, they're feeding the network 64000 random samples each epoch (so only some of the training examples were "seen" before): Each training instance was a uniformly sampled set of 3 images, 2 of which are of the same class (x and x+), and the third (x−) of a different class. Each

Machine learning Can dataframe processed with StandartScaler contain values >1 or <-1?

i scale my feature dataframe as follows: flattened_num_f.head() num_features_test = flattened_num_f.fillna(flattened_num_f.mean()) from sklearn.preprocessing import StandardScaler std_scaler = StandardScaler() num_train_std = pd.DataFrame(std_scaler.fit_transform(num_features_test.loc[y_train_IDs]), \ columns=num_features_test.loc[y_train_IDs].columns, \ index=num_features_test.loc[y_train_IDs].index) test_for_stdness(num_train_std) the la

Machine learning generative adversarial network generating image with some random pixels

I am trying to generate images using Generative Adversarial Networks(GANs) on CelebA aligned data set with each image resized to 64*64 in .jpeg format. My network definition is like this def my_discriminator(input_var= None): net = lasagne.layers.InputLayer(shape= (None, 3,64,64), input_var = input_var) net = lasagne.layers.Conv2DLayer(net, 64, filter_size= (6,6 ),stride = 2,pad=2,W = lasagne.init.HeUniform(), nonlinearity= lasagne.nonlinearities.LeakyRectify(0.2))#64*32*32 net =

Machine learning Machine Learning - Derive information from a text

I'm a newbie in the field of Machine Learning and Supervised learning. My task is the following: from the name of a movie file on a disk, I'd like to retrieve some metadata about the file. I have no control on how the file is named, but it has a title and one or more additional info, like a release year, a resolution, actor names and so on. Currently I have developed a rule heuristic-based system, where I split the name into tokens and try to understand what each word could represent, either a

Machine learning Normalizing Input for Machine Learning Algorithm

I would like to normalize (z-score, minmax etc.) my predictor variables for a number of Machine Learning algorithms (Neural Network) and a Log Regression and I am wondering: 1) Should I normalize the entire predictor variables, that is training AND Test data? 2) Should normalize my predicted variables, y?

Machine learning GlobalAveragePooling with Masking when mask is not equal to zero

We are implementing a GlobalAveragePooling on top of a masking layer causing Not Supported error. We saw this solution but unfortunately this solution cant fit our situation. We use a custom embedding algorithm, causing some samples to be all zeros, thus we can not mask with zero vectors e.g. self.model.add(Masking(mask_value=-9999., input_shape=(max_length, nr_in)))And some layers afterwards avged = GlobalAveragePooling1D()(result, mask=max_len) maxed = GlobalMaxPooling1D()(result, mask=max_le

Machine learning Can we fit a model using the same dataset applied during cross validation process?

I have the following method that performs Cross Validation on a dataset followed by a final model fit: import numpy as np import utilities.utils as utils from sklearn.model_selection import cross_val_score from sklearn.neural_network import MLPClassifier from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split import pandas as pd from sklearn.utils import shuffle def CV(args, path): df = pd.read_csv(path + 'HIGGS.csv', sep=',') df = shuffle(df)

Machine learning Concepts about data distributions

I'm a little confused here is there a connection between data distribution and detecting novelty, I mean can data distribution differ between novelty, noise, or outlier? In order to detect them! Another point need to be answered as well: "training data and test data are drawn from the same distribution or the same feature space " so when exactly does the data distribution change? And when the data distribution changes, on which set I'm supposed to focus on? where/when can this happen?

Machine learning Data Science Case Study

Please help me to get the right answer of the below question, which is asked in one of the interview. There is a bank, no of users visit bank for different- different services, but most of the users give bad rating and goes unsatisfied. What should bank do identify the reasons for the bad ratings. Bank capture data like, user info, agent info who deals with users, services offered, and no if things. How to identify rules or reasons which are playing an important role in bad rating using mac

Machine learning How to squish a continuous cosine-theta score to a discrete (0/1) output?

I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'. I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these

Machine learning How to specify the input_shape(input_dim) in the Keras Sequential model dynamically while using pipeline?

I create a Keras Sequential model by passing a list of layer instances to the constructor. For that, I need to pass an input_shape argument to the first layer to create_model() function. Generally, I can get a shape tuple like this: input_shape=(len(X_train.keys()),) Meanwhile, I am using the pipeline to take care of my preprocessing steps such as Imputing, Scaling, Encoding, Feature Selection, and etc. As a result, the number of variables/features after preprocessing is not the same as befor

Machine learning Neural Network not recognizing basic input pattern

I've built a recurrent neural network to predict time series data. I seemed to be getting reasonable results, it would start at 33% accuracy for three classes as expected and would get better up to a point, also as expected. I wanted to test the network just to make sure it's actually working so I created a basic input / output as follows: in1 in2 in3 out 1 2 3 0 5 6 7 1 9 10 11 2 1 2 3 0 5 6 7 1 9 10 11 2 I copied this pattern

Machine learning My image segmentation Model gives very high accuracy on train and validation but outputs blank masks

I used Dice Loss and binary_crossentropy whenever I train my model it shows very high train and validation accuracy but always prints out blank images. My masks are black and white binary images where 0 corresponds to black and 1 corresponds to white. In my output image, almost all pixels have value 0 please tell me where am I going wrong. def train_generator(): while True: for start in range(0, len(os.listdir('/gdrive/My Drive/Train/img/images/')), 16): x_batch = np.emp

Machine learning How to increase accuracy of model using catboost

I am trying to build a model for binary classification using catboost for a employee salary dataset. I have tried utmost tuning but still i am getting only 87% accuracy how can i increase it to ~98% or more? Goal is to predict the class. Here is the dataset and code: Dataset: Code: from catboost import CatBoostClassifier import pandas as pd import numpy as np from numpy import arange from tqdm import tqdm_notebook as tqdm

Machine learning training fasttext models with social generated content

I am currently learning about text classification using Facebook FastText. I have found some data from Kaggle that contains characters such as �� or twitter username and hashtags. I tried searching the web however there is no clarification of how you really need to clean/pre-process your text before training a model. In some blogs I've seen authors writing about tokenisation however its not mentioned in fasttext. Another point it that fasttext git has examples of clean data, such as stackoverf

Machine learning How K-Fold Prevents overfitting in a model

I am training a Multi-layer Perceptron . I have two questions first one is that How can K fold prevents Overfitting because train-test-split also do same thing that take the training part and validate the model , same as for K fold instead of just there are multiple folds . But there is a chance of overfitting in train_test_split , then how K fold prevents it , because in my perception model could also gets overfit into train part of K fold what you think ? Second Question is that i am getting

Machine learning Catboost - prediction intervals

I'm trying to predict a price and for each prediction I need to have a range in which that price should be or get the percentage of how confident that prediction is. I've tried to do that with Quantile Regression as a loss function ( as an algorithm I'm using catboost), but the ranges were really wide. Is there any other way to solve that?

Machine learning Genetic Algorithm in Maximum likelyhood Estimation

I was trying to estimate a maximum likelihood , which is very complicated even the log likelihood estimation seems tough to me to compute . Basically I need estimate it for a parameter which is a vector , that resource suggested me to tune this parameter theta by Stochastic Gradient Descent , but it seems very tough for me to compute the gradient of log likelihood function . So I'm asking is there any efficient python library to compute MLE by SGD. If not then can we use simple genetic Algorithm

Machine learning using continious values for hyperparameters in grid search (ANN)

I'm trying to tune hyperparameters in a neural network (regression problem), and i have few questions: which order should i use in automatic optimisation methods (grid , random , bayesian , genetics, ...) i started with grid search to get an idea of the learning and i know grid give us optimal result but its a time consuming , i dont have problem with the time so i want to try the best search space but i only know how to choose a discret values for a hyperparameter and i dont know how to give

Machine learning How to add training end callback to AllenNLP config file?

Currently training models using AllenNLP 1.2 and the commands api: allennlp train -f --include-package custom-exp /usr/training_config/mock_model_config.jsonnet -s test-mock-out I'm trying to execute a forward pass on a test dataset after training is completed. I know how to add an epoch_callback, but am not sure about the syntax for the end_callback. In my config.json, I have the following: { ... "trainer": { ... "epoch_callbacks": [{"type": 'log_metrics_to_wandb',

  1    2   3   4   5   6  ... 下一页 最后一页 共 49 页