Though, “Record Linkage” is a popular word among statisticians, and epidemiologists - “the problem of matching/joining records from one data source to another which describe the same entity”; has a long historical attention from the time since data collection gained (1960s) and continues to gain attention as new methods of collection, formats and stacks of data being added to the existing. The other popular terms for the same are deduplication, data matching, entity/name resolution, record matching, etc. Please, refer to the following paper https://homes.cs.washington.edu/~pedrod/papers/icdm06.pdf, for one of the good works in this field. Also, one can look at the below google trends graph for the attention to this filed from 2014 to the present.
The purpose of this blog is to bring forth, why record linkage needs a scalable computing power, for which I present my observations with an simple example as show below:
Views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail or firstname.lastname@example.org for more details.
Find more about author at http://in.linkedin.com/in/pradeepmavuluri