Recommendations in UDC (Unified Data Catalog) Powered by Apache Spark & Neo4J

By Deepak Chandramouli

Elevator Pitch

Unified view of an enterprise’s data assets is extremely powerful in solving data problems. But searching & viewing relevant information turns into an overwhelming experience. Recommending data assets by “connecting” [users, orgs, searches, queries, more] makes data catalog extremely relevant.

Description

UDC (Unified Data Catalog) open sourced by PayPal is an enterprise Data catalog tool that brings datasets from all the various stores into one view. A Dataset in UDC could be a RDBMS table, Kafka topic, Elastic Search index, Mongo Document, HBASE Table, or even a Rest API. UDC is for Data Engineers, Analysts, Scientists, Security personnel & also executives who can get a summary of entire data landscape. UDC provides a rich UI with advanced search capabilities over enterprise datasets, business and technical metadata, access controls, glossaries, etc. In its mature state, UDC in PayPal has millions of datasets from 100s of variety stores in the enterprise. This poses a big challenge in providing relevant search results to users, also in recommending relevant datasets for new users. This is where the power of connecting within NEO4J - various entities (nodes) such as datasets, users, organization, location, user-queries, user-searches, user-accesses, user-dataset-ownership - opens up a completely new view of the ecosystem that has not been seen yet. This new view of all the connected components, combined with power of Apache Spark & Graph algorithms to process the massive network of graph - helps draw useful recommendations for all the users in an organization. While these recommendations in-turn power the Data Catalog UI experience, the computation of scores for each dataset based on the connectivity help assign weightage, thus resulting in a more refined and relevant search experience.

Notes

The talk will be 4 parts - - Part1 introduces the architecture of the Entire stack - online, offline model of data processing. [10 Minutes] - Part2 will cover the demo of the product & it’s working. [10 Minutes] - Part3 will cover the code - Apache Spark, Neo4J Cypher Queries, Graph algorithms used in the application. [10 Minutes] - Part4 - Questions [5 Minutes]

Note - Currently Neo4J community edition is in the exploration phase in PayPal.

Speakers –> - Deepak Chandramouli is the PayPal Engineering Manager for products - UDC (Unified Data Catalog) and Gimel (Apache Spark based Data Abstraction Layer). UDC is in the process of being spun off as a separate open source project/product. Currently, the code is under the Gimel’s open source repo. (gimel.io). Deepak has been with UDC since its incubation as a config system (backed in) for UDC back in 2017. Deepak is a proponent of Neo4J as the right choice of Graph Database for UDC’s requirements, and was involved in a POC of bringing all the data into Neo4J and connecting the disparate details. - Harsh Bhimani is one of the core software engineers of UDC. Harsh is majorly focussing on building the Discovery of stores & Recommendation Process for UDC. He has been mentoring an intern at PayPal, whose tenure focussed on the MVP delivery of graph (Neo4J) integration with UDC.