Machine learning is a sub-field of artificial intelligence wherein computer systems have the ability to inherently learn from data without being explicitly programmed to. Companies like Google, Yelp, Facebook, and HubSpot are using machine learning for various applications. For instance, Yelp uses machine learning to help their human employees categorize millions of photographs. Google’s famous neural network – The DeepMind aka the machine that dreams and produces psychedelic images – is an example of machine learning. Facebook is using machine learning in their messenger service to eliminate spam. In short, it is safe to say that machine learning is pretty hot right now. If you are into technology, and are looking for some open source tools in an enterprise level language like Java to get you started, you should read on.

1. Weka

Developed by the University of Waikato, New Zealand, the Waikato Environment for Knowledge Analysis is an open source suite of machine learning software. While the original version of Weka was intended as a tool for the agricultural domain, the Java-based version (Weka 3) is intended for educational and research purposes and data mining. Under data mining, the following tasks are possible, namely, data preprocessing, clustering, classification, regression, visualization, and feature selection.
Weka allows access to SQL databases via Java Database Connectivity and can output results returned by a database query. The YouTube channel WekaMOOC hosts a number of video playlists on data mining with Weka. You can download Weka 3 here which links to other resources like online courses.

2. Massive Online Analysis (MOA)

Also developed by the University of Waikato, New Zealand, MOA is a free open-source software that allows you to build and run experiments in machine learning or data mining on data streams. If your use case is for real-time data streams, you’re in luck because that’s exactly what MOA was designed for! MOA’s simplistic UI and easy integration with Weka lands them in our second spot. If you’re looking to download MOA, click here. MOA is particularly popular in the data mining field because of its community. Its extensive documentation is useful for beginners or you can just watch this video to get started.

3. Environment for Developing KDD-Applications Supported by Index-Structure (ELKI)

ELKI was developed at the Ludwig Maximilian University of Munich, Germany and is used for developing advanced data mining algorithms. The ELKI framework is based in Java and has a modular architecture which is great for researchers and students. The ELKI library has a large collection of configurable algorithm parameters that are particularly useful for benchmarking algorithms. ELKI has uses in data science, spaceflight, and traffic prediction. The sky’s the limit with this open source software suite; download it from here and check out a few tutorials here.

4. RapidMiner

RapidMiner is a cross-platform data science software platform that provides an integrated environment in data preparation, machine learning, deep learning, text mining, and predictive analysis. It has a great graphical user interface and a Java API for developing your own applications. RapidMiner pricing comes with a free, small, medium and large version. The free version is limited to 10,000 data rows and 1 logical processor, whereas the large version which costs $10,000 per user per year gives you unlimited data rows, logical processors and other features like RapidMiner’s Turbo Prep that accelerates data preparation by visually blending and enriching data, enabling analytics teams to work with data faster.

5. Java Statistical Analysis Tool (JSAT)

JSAT is a popular library for quickly getting into Machine Learning. Developed by Edward Raff, and a completely open source project, Edward humbly boasts that compared to Weka, JSAT is usually faster! The best part about JSAT is that it’s pure Java. Having no external dependencies and a library with the code all self-contained is nice, so it’s worth a look at.


MALLET was developed by Andrew McCallum and students from UMASS and UPenn. MALLET is a Java-based toolkit for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and supports other machine learning text-based applications. MALLET also includes a bunch of routines for transforming text into numerical representations that can be processed efficiently. MALLET is open source and can be downloaded here, and it’s documentation can be downloaded here.

7. Deeplearning4j

Deeplearning4j is a deep learning programming library for Java and the Java virtual machine. It serves as a framework for deep learning algorithms. Deeplearning4j is also open source under the Apache License 2.0 and can be used in Topic Modeling and Vector space modeling. Some of the most common applications for Deeplearning4j include cyber security, anomaly detections, recommendation systems for e-Commerce sites, and image recognition. Deeplearning4j also has an active community known as Eclipse where resources and support are easy to find.

8. Google BigQuery

Google’s cloud-based solution that enables interactive analysis of large datasets via Google Storage. Since BigQuery is an Infrastructure as a Service, it doesn’t require you to invest in the setup cost of the data warehouse or the manpower needed to run queries. Google BigQuery brags about processing queries “rocket fast”, fast enough to analyze terabytes of data in seconds, and petabytes of data in minutes. The pricing is based on the number of bytes processed. For large queries, Google will assign an entire data center for you so that you don’t have to worry about scaling. Google BigQuery encrypts and replicates your data across multiple data centers for maximum durability and service uptime.

9. Mahout

Mahout is a distributed linear algebra framework and mathematically expressive Scala DSL designed to let data scientists quickly implement their own algorithms. Mahout comes with Java libraries for common maths operations and primitive Java collections. The core algorithms of Mahout include distributed linear algebra, preprocessors, regression, clustering and recommenders. Support for MapReduce algorithms is being gradually phased out.