Zhiyong's profileIvan's spacePhotosBlogListsMore ![]() | Help |
|
Ivan's spaceColt - a set of Open Source Libraries for High Performance Scientific and Technical Computing in JavaColt provides a set of Open Source Libraries for High Performance Scientific and Technical Computing in Java.
Scientific and technical computing, as, for example, carried out at CERN, is characterized by demanding problem sizes and a need for high performance at reasonably small memory footprint. There is a perception by many that the Java language is unsuited for such work. However, recent trends in its evolution suggest that it may soon be a major player in performance sensitive scientific and technical computing. For example, IBM Watson's Ninja project showed that Java can indeed perform BLAS matrix computations up to 90% as fast as optimized Fortran. The Java Grande Forum Numerics Working Group provides a focal point for information on numerical computing in Java. With the performance gap steadily closing, Java has recently found increased adoption in the field. The reasons include ease of use, cross-platform nature, built-in support for multi-threading, network friendly APIs and a healthy pool of available developers. Still, these efforts are to a significant degree hindered by the lack of foundation toolkits broadly available and conveniently accessible in C and Fortran. JAMA : A Java Matrix PackageJAMA : A Java Matrix Package JAMA is a basic linear algebra package for Java. It provides user-level classes for constructing and manipulating real, dense matrices. It is meant to provide sufficient functionality for routine problems, packaged in a way that is natural and understandable to non-experts. It is intended to serve as the standard matrix class for Java, and will be proposed as such to the Java Grande Forum and then to Sun. A straightforward public-domain reference implementation has been developed by the MathWorks and NIST as a strawman for such a class. We are releasing this version in order to obtain public comment. There is no guarantee that future versions of JAMA will be compatible with this one. A sibling matrix package, Jampack, has also been developed at NIST and the University of Maryland. The two packages arose from the need to evaluate alternate designs for the implementation of matrices in Java. JAMA is based on a single matrix class within a strictly object-oriented framework. Jampack uses a more open approach that lends itself to extension by the user. As it turns out, for the casual user the packages differ principally in the syntax of the matrix operations. We hope you will take the time to look at Jampack along with JAMA. There is much to be learned from both packages. Capabilities. JAMA is comprised of six Java classes: Matrix, CholeskyDecomposition, LUDecomposition, QRDecomposition, SingularValueDecomposition and EigenvalueDecomposition. The Matrix class provides the fundamental operations of numerical linear algebra. Various constructors create Matrices from two dimensional arrays of double precision floating point numbers. Various gets and sets provide access to submatrices and matrix elements. The basic arithmetic operations include matrix addition and multiplication, matrix norms and selected element-by-element array operations. A convenient matrix print method is also included. Five fundamental matrix decompositions, which consist of pairs or triples of matrices, permutation vectors, and the like, produce results in five decomposition classes. These decompositions are accessed by the Matrix class to compute solutions of simultaneous linear equations, determinants, inverses and other matrix functions. The five decompositions are
The design of JAMA represents a compromise between the need for pure and elegant object-oriented design and the need to enable high performance implementations.
Example of Use. The following simple example solves a 3x3 linear system Ax=b and computes the double[][] array = {{1.,2.,3},{4.,5.,6.},{7.,8.,10.}}; Reference Implementation. The implementation of JAMA downloadable from this site is meant to be a reference implementation only. As such, it is pedagogical in nature. The algorithms employed are similar to those of the classic Wilkinson and Reinsch Handbook, i.e. the same algorithms used in EISPACK, LINPACK and MATLAB. Matrices are stored internally as native Java arrays (i.e., double[][]). The coding style is straightforward and readable. While the reference implementation itself should provide reasonable execution speed for small to moderate size applications, we fully expect software vendors and Java VMs to provide versions which are optimized for particular environments. Not Covered. JAMA is by no means a complete linear algebra environment. For example, there are no provisions for matrices with particular structure (e.g., banded, sparse) or for more specialized decompositions (e.g. Shur, generalized eigenvalue). Complex matrices are not included. It is not our intention to ignore these important problems. We expect that some of these (e.g. complex) will be addressed in future versions. It is our intent that the design of JAMA not preclude extension to some of these additional areas. Finally, JAMA is not a general-purpose array class. Instead, it focuses on the principle mathematical functionality required to do numerical linear algebra. As a result, there are no methods for array operations such as reshaping or applying elementary functions (e.g. sine, exp, log) elementwise. Such operations, while quite useful in many applications, are best collected into a separate array class.
The PackageVersion 1.0.2
Previous version
Request for CommentsWe plan to propose JAMA as the primary linear algebra package for Java. Such standardization will insure wide availability, improving the portability and performance of Java applications with numeric components. Because of this we are interested in hearing any and all comments of potential users. While we are cognisant that JAMA will not be suitable for all users, we hope it to be useful to the majority of routine applications.Discussion Group. A discussion group has been established for such comments. Comments and suggestions sent to jama@nist.gov will automatically be sent to the JAMA authors, as well as to all subscribers. To subscribe, send email to listproc@nist.gov containing the text subscribe jama your-name in the message body. A public archive of the discussion can be browsed. [Note: NIST will not use the email addresses provided for any purpose other than the maintenance of this discussion list. Participants may remove themselves at any time by sending an email message to listproc@nist.gov containing the text unsubscribe jama in the message body. See the NIST Privacy Policy.] |
||||||||||||||||||||||||||
| Joe Hicklin Cleve Moler Peter Webb ... from The MathWorks
| Ronald F. Boisvert Bruce Miller Roldan Pozo Karin Remington ... from NIST |
Copyright Notice This software is a cooperative product of The MathWorks and the National Institute of Standards and Technology (NIST) which has been released to the public domain. Neither The MathWorks nor NIST assumes any responsibility whatsoever for its use by other parties, and makes no guarantees, expressed or implied, about its quality, reliability, or any other characteristic.
Last change in this page : July 13, 2005. Comments welcome.
http://users.cs.cf.ac.uk/Ali.Shaikhali/faehim/index.htm
FAEHIM
The availability of Web Service standards (such as WSDL, SOAP), and their adoption by a number of communities, including the Grid community as part of the Web Services Resource Framework (WSRF) indicates that development of a data mining toolkit based on Web Services is likely to be useful to a significant user community. Providing data mining Web Services also enables these to be integrated with other third party services, allowing data mining algorithms to be embedded within existing applications.
The aim of the
(Federated Analysis Environment for Heterogeneous Intelligent Mining) project is to present a data mining toolkit that makes use of Web ervices composition, with the widely deployed Triana workflow environment. more
Data is currently being collected and accumulated at a dramatic pace in a number of different scientific areas. This data accumulation can vary from the long time archiving of the entire collection of raw data, to the persistent storage of summary statistics only. The type of data being analysed can also vary in content from text-based data streams to numeric data (and increasingly image -based data), managed in distributed file systems or structured databases. There is often a distinction made between machine learning algorithms/statistical analysis and data mining; the former is seen as the set of theories and computational methods needed to deal with a variety of different analysis problems, whereas the latter is seen as a means to encode such algorithms in a form that can be efficiently used in real world applications. Often data mining applications and toolkits contain a variety of machine learning algorithms that can be used alongside a number of other components, such as those needed to sample a data set, read/write output from/to data sources, and visualise the outcome of analysis algorithms in some meaningful way.
Visualisation is also often seen as a key component within many data mining applications, as the results of data mining applications/toolkits are often used by individuals not fully conversant with the details of the algorithm deployed for analysis. Further, users of results of data mining are generally domain experts (and not algorithm experts) , and often some (albeit limited) support is needed to allow such a user to chose an algorithm. The basic problem addressed by the data mining process is one of mapping low-level data (which are typically too voluminous to understand) into other forms that might be more compact (for example, a short report), more abstract (for example, a descriptive approximation or model of the process that generated the data), or more useful (for example, a predictive model for estimating the value of future cases). At the core of the process is the application of specific data-mining methods for pattern discovery and extraction. This process is often structured from interactive and iterative stages within a discovery pipeline/workflow. At these different stages of the discovery pipeline , a user needs to access, integrate and analyse data from disparate sources, to use data patterns and models generated through intermediate stages, and feed those models to further stages in the pipeline. Consider, for instance, a breast-cancer data set acquired by a cancer research centre, where a physician carries out a series of experiments on breast cancer cases and records the results in a database. The data now needs to be analysed to discover knowledge of the possible causes (ortrends) of breast cancer. One approach is to use a classification algorithm. However, applying an appropriate classification algorithm requires some preliminary understanding of the approach used in the classification algorithm, and in the instance where the size of data is large, for processing of the data to be carried out on computational resources suitable to handle the large volume of data.
The availability of Web Service standards (such as WSDL, SOAP), and their adoption by a number of communities, including the Grid community as part of the Web Services Resource Framework (WSRF) indicates that development of a data mining toolkit based on Web Services is likely to be useful to a significant user community. Providing data mining Web Services also enables these to be integrated with other third party services, allowing data mining algorithms to be embedded within existing applications.
The
project presents a data mining toolkit that makes use of Web Services composition, with the widely deployed Triana workflow environment. Most of the Web Services are derived from the WEKA data mining library of algorithms, and contain approximately 75 different algorithms (primarily classifiers, clustering algorithms and association rules). Additional capability is provided to support attribute search and selection within a numeric data set, and 20 different approaches are provided to achieve this (such as a genetic search operator). Visualisation capability is provided by wrapping the GNUPlot software; additional capability is supported through the deployment of a Mathematica Web Service (developed using the MathLink software). Other visualisation routines include a decision tree and a cluster visualiser.
| The Following Web services are Available for use: | |
| Classification Web services | |
| Clustering Web services | |
| Plotting Web services |
Data mining uncovers patterns in data using predictive techniques. These patterns play a critical role in decision making because they reveal areas for process improvement. Using data mining, organizations can increase the profitability of their interactions with customers, detect fraud, and improve risk management. The patterns uncovered using data mining help organizations make better and timelier decisions.
Most analysts separate data mining software into two groups: data mining tools and data mining applications. Data mining tools provide a number of techniques that can be applied to any business problem. Data mining applications, on the other hand, embed techniques inside an application customized to address a specific business problem. Regardless of whether we are aware of it, our daily lives are influenced by data mining applications. For example, almost every financial transaction is processed by a data mining application to detect fraud. Both data mining tools and data mining applications are valuable, however. Increasingly, organizations are using data mining tools and data mining applications together in an integrated environment for predictive analytics.
So what do data mining tools add? Data mining tools are used to ensure flexibility and the greatest accuracy possible. Essentially, data mining tools increase the effectiveness of data mining applications. Since no two organizations or data sets are alike, no single technique delivers the best results for everyone. Not only do data mining tools deliver in-depth techniques, but data mining tools also deliver flexibility to use combinations of techniques to improve predictive accuracy.
Because data mining tools are so flexible, a set of data mining guidelines and a data mining methodology have been developed to help guide the process. The Cross-Industry Standard Process for Data Mining (CRISP-DM) ensures your organization's results with data mining tools are timely and reliable. This methodology was created in conjunction with practitioners and vendors to supply data mining practitioners with checklists, guidelines, tasks, and objectives for every stage of the data mining process.
|
There are no photo albums.
|
|
|