Description
The problem our capstone project seeks to solve is finding groupings of similar patents using two data sets with information on 5 million previously filed patents. The first data set has records of what previous patents each patent cited, and the second one contains text from each patent. Our solution converts these two data sets into a “samples” and “features” format, which is then usable by a machine learning technique known as clustering. Clustering takes data items in the aforementioned format and groups similar items together. Clustering is a very common practice, and a literature review revealed a plethora of approaches, including spectral clustering, K Means, and hierarchical clustering. We measured the accuracy of each approach by checking how consistent the groupings were with a third data set, a text file containing pairs of patents that blocked each other from being filed. The best approach we found ended up being a modified version of the standard K Means. Due to issues with the citation data set, as well as time and server memory constraints we achieved some success but did not reach the desired accuracy.