Cluster size VS No of words per cluster
Performance metrics for Document clustering
The main goal of the IR system is to extract the required information from the broad text corpus according to the user’s query. Many of the conventional IR systems proposed by different researchers used standard data mining algorithms and the necessary conclusions were drawn. Some of them have also been listed in the past literature, but very few members have carried out their research on Big Data Systems, which provides the ability to process data on a large scale. Most of the work that is performed by the researchers on Big Data systems suggested a novel map and reduce algorithms, but only the default partitioner would be managed by the architecture itself. For our proposed work, along with a novel map and reducing algorithms, we used a custom partitioner named Residue Modulo K, where K represents the number of reducers that the design invokes. Performance indicators have been estimated for each and every clustered document and the maximum likelihood values have been achieved in terms of significance. In addition, the size and number of words obtained for each and every cluster is clearly tabulated and contrasted, which has not been stated in past research findings.