4 min readfrom Data Science

I wrapped a random forest in a genetic algorithm for feature selection due to unidentifiable, group-based confounding variables. Is it bad? Is there better?

No tldr for this one, folks.

I had initially posted about my issue in another sub, but didn’t get much feedback. I then read up on genetic algorithms for feature selection, and decided to give it a shot. Let me acknowledge beforehand that there’s a serious processing cost problem.

I’m trying to create a classification model with clearly labeled data that has thousands of features. The data was obtained in a laboratory setting, and I’ll simplify the process and just say that the condition (label/class) was set and then data was taken once per minute for 100 minutes. Let’s say we had three conditions (C1, C2, C3), and went through the following rotation in the lab: C1, C2, C1, C3, C1, C2, C1, C3, C1. C1 was a control group. Glossary moment: I call each section of time dedicated to a condition an “implementation” of that condition.

After using exploratory data analysis (EDA) to eliminate some data points as well as all but 1000 features, I created a random forest model. The test set had nearly 100% accuracy. However, I’ve been burned before by data leakage and confounding variables. I then performed leave-one-group-out (LOGO), where I removed each group (i.e. the first implantation of C1), created a model with the rest of the data, and then I used the removed group as a test set. The idea being that if I removed the first implementation of a condition, training on another implementation(s) should be enough to accurately classify it.

Results were bad. Most C1s achieved 70-100% accuracy. C2s both achieved 0% accuracy. C3s achieved 10% accuracy and 40% accuracy. So even though, as far as I knew, each implementation of a condition was the same, they clearly weren’t. Something was happening- I assume some sort of confounding variable based on the time of day or the process of changing the condition.

My belief is that the original model was accurate because it contained separate models for each implementation “under the hood”. So one part of each decision tree was for the first implementation of C2, a separate part of the tree was for the second implementation of C2, but they both end in a vote for the C2 class, making it seem like the model can identify C2 anytime, anywhere.

I then hypothesized that while some of my thousand features were specific to the implementation, there might also be some features that were implementation-agnostic but condition-specific. The problem is that the features that were implementation-specific were also far more attractive to the random forest algorithm, and I had to find a way to ignore them.

I created a genetic algorithm where each chromosome was a binary array representing whether each feature would be included in the random forest. The scoring had a brutal processing cost. For each implementation (so 9 times) I would create a random forest (using the genetic algorithm’s child-features) with the remaining groups and use the implementation as a test. I would find the minimum accuracy for each condition (so the minimum for the five C1 test results, the minimum for the two C2 test results, and the minimum for the two C3 test results) and use NSGA2 for multi-objective optimization (which I admit I am still working on fully understanding).

I’ve never had hyperparameters matter so much as when I was setting up the genetic algorithm. But it was *so* costly. I’d run it overnight just to get 30 generations done.

The results were interesting. Individually, C1s scored about 95%, C2s scored about 5%, and C3s scored about 60%. I then used the selected features to create a single random forest as I had done originally, and was disappointed to achieve nearly 100% accuracy again. *However*, when I performed my leave-one-group-out approach, I was pretty consistently getting 95% for C1, 0% for C2, and 60% for C3. So I was getting what the genetic algorithm said I’d be getting, *which was better and much more consistent than my original LOGO* and I feel would be the more accurate description of how good my model is, as opposed to the test set’s confusion matrix.

For those who have made it this far, I pulled that genetic algorithm wrapper idea out of thin air. In hindsight, do you think it was interesting, clever, a waste of time, seriously flawed? Is there a better approach for dealing with unidentifiable, group-based, confounding variables?

submitted by /u/wex52
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#generative AI for data analysis
#Excel alternatives for data analysis
#natural language processing for spreadsheets
#real-time data collaboration
#conversational data analysis
#data analysis tools
#big data management in spreadsheets
#intelligent data visualization
#data visualization tools
#enterprise data management
#big data performance
#data cleaning solutions
#financial modeling with spreadsheets
#real-time collaboration
#cloud-based spreadsheet applications
#large dataset processing
#natural language processing
#rows.com
#random forest
#genetic algorithm