This article is machine translated
Show original

Data Annotation, This "Tough and Exhausting" Work, Quietly Becoming a Hot Commodity? This @OpenledgerHQ, led by Polychain and receiving over $11.2 million in funding, uses a unique PoA+infini-gram mechanism, targeting the long-ignored pain point of "data value distribution". Let's popularize this from a technical perspective: 1) Honestly, the biggest "original sin" of the current AI industry is the unfair distribution of data value. What OpenLedger's PoA (Proof of Attribution) wants to do is establish a "copyright tracking system" for data contributions. Specifically: Data contributors upload content to domain-specific DataNets, where each data point is permanently recorded along with contributor metadata and content hash. When models are trained on these datasets, the attribution process occurs during the inference stage, which is the moment the model generates output. PoA tracks which data points influenced the output by analyzing matching scope or impact scores, determining the proportional influence of each contributor's data. When the model generates revenue through inference, PoA ensures profits are accurately distributed according to each contributor's influence—creating a transparent, fair, and on-chain reward mechanism. In other words, PoA resolves the fundamental contradiction of data economics. The past logic was simple and brutal—AI companies freely obtain massive data, then earn huge profits by commercializing models, while data contributors get nothing. But PoA achieves "data privatization" through technical means, enabling each data point to generate clear economic value. I believe that once this mechanism from "free riding" to "distribution based on contribution" runs smoothly, the incentive logic of data contribution will completely change. Moreover, PoA adopts a layered strategy to solve attribution problems for models of different scales: For small models with millions of parameters, influence can be estimated by analyzing model impact functions, with computational costs still barely manageable. However, for medium and large parameter models, this method becomes computationally impractical and inefficient. At this point, the Infini-gram killer move must be deployed. 2) Here comes the question: What is infini-gram technology? The problem it aims to solve sounds quite extreme: precisely tracking the data source for each output Token in medium and large parameter black-box models. Traditional attribution methods mainly rely on analyzing model impact functions, but they essentially fail before large models. The reason is simple: As models grow larger, internal calculations become more complex, and analysis costs grow exponentially, becoming computationally impractical and inefficient. This is completely unrealistic in commercial applications. Infini-gram takes a completely different approach: Since the model's internal structure is too complex, it directly finds matches in the original data. It builds indexes based on suffix arrays, using dynamically selected longest matching suffixes instead of traditional fixed-window n-grams. Simply put, when the model outputs a sequence, Infini-gram identifies the longest exact match in training data for each Token's context. The performance data is truly impressive: for a 1.4 trillion Token dataset, queries take only 20 milliseconds, with storage of 7 bytes per Token. More critically, it can precisely attribute without analyzing model internal structures or requiring complex calculations. For AI companies that treat models as trade secrets, this is a tailor-made solution. Know that existing data attribution solutions are either inefficient, lack precision, or require internal model access. Infini-gram has found a balance across these three dimensions. 3) Additionally, I find the dataNets on-chain dataset concept proposed by OpenLedger particularly innovative. Unlike traditional one-time data transactions, DataNets allow data contributors to continuously enjoy revenue sharing during model inference. Previously, data annotation was a thankless task with meager, one-time compensation. Now it has become an asset with continuous earnings, with a completely different incentive logic. While most AI+Crypto projects focus on more mature directions like computing power leasing and model training, OpenLedger chose the hardest bone—data attribution. This technology stack might redefine the supply side of AI data. After all, in an era where data quality is king, whoever can solve the data value distribution problem will attract the highest-quality data resources. In summary, OpenLedger's PoA + Infini-gram combination not only solves technical challenges but, more importantly, provides an entirely new value distribution logic for the industry. As the computing power arms race cools down and data quality competition intensifies, such technical routes will certainly not remain unique. This track will see multiple solutions competing in parallel—some focusing on attribution precision, others on cost efficiency, and some emphasizing usability. Each is exploring the optimal solution for data value distribution. Ultimately, which company will emerge victorious depends on whether they can truly attract enough data providers and developers.

From Twitter
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments