Viki.com is a global online video service that lets millions of people watch their favorite TV shows, movies, sporting events, and more.
What really sets us apart is our international viewership: our content is available in over 150 languages, all translated by our user community.
The key to Viki’s rapid growth? Exhaustive data analytics. We collect all kinds of data and analyze it to guide our product decisions. As our service grew in scale, Treasure Data helped solve our growing pains.
Tables of Contents
Focus on Service Development, not Hadoop Management
We started out by rolling up our sleeves and building out an in-house Hadoop cluster. Actually, building out the Hadoop cluster wasn’t too difficult. A month later, our cluster was churning out MapReduce jobs. But those happy days didn’t last for long.
Six months later, the same cluster had become a source of headaches for our engineers. Our in-house cluster simply couldn’t keep up with the surging data volume, and precious engineering resources were being used to tame Hadoop. Our engineers were constantly being sidetracked by infrastructure problems. We had to do something about it.
Viki collects a lot of data. Last I checked, our service generated 100GB of gzip-compressed data per month and counting. In addition to solving our Hadoop maintenance issues, we needed a robust way to collect data.
Data volume was one problem, but data variety was an issue for us as well. Collecting data from both our webapp and mobile apps meant that we needed a data collection layer that could evolve in parallel with our frontend applications.
Treasure Data was a “two birds with one stone” solution for us: they solved both our “Hadoop is taking up engineering resources” problem and our “we need to collect data real-time in a manageable way” problem.
No More Hadoop Maintenance
Treasure Data’s service let us forget about Hadoop altogether. Treasure Data uses Hadoop under the hood, but its complexity was now completely hidden from us. We could now just send data to them and query it later.
We no longer had to worry about our Hadoop cluster going down or tuning parameters to improve performance. Treasure Data’s dedicated engineers vigilantly monitor their systems around the clock, freeing up our engineers from infrastructure maintenance.
td-agent, a Versatile Data Collector
Treasure Data’s data collector, td-agent (open-sourced as Fluentd), greatly simplified our real-time data collection. What we love the most about td-agent is its extensibility.
Its vibrant community has created more than 60 plugins (yes, we’ve contributed a couple of them ourselves! cf. fluent-plugin-http-enhanced, fluent-plugin-udp), making it simple to plug-and-play with numerous data sources and sinks. This versatility is what makes us confident that td-agent can keep up with our evolving data pipeline needs.
As a bonus, it has also allowed us to collect data from geographically distributed sources with ease.
Result Output into PostgreSQL: Simplifying Our System Even More!
Last but not least, one feature that we truly love is “Result Output”. Before Treasure Data, we had custom scripts that put our Hive query results into our PostgreSQL, which our internal tools accessed for interactive analytics. But these scripts would fail occasionally, leaving our PostgreSQL with bad data.
Treasure Data, on the other hand, comes with a nice feature called Result Output, which lets the user write query results directly into local data storage. They naturally support PostgreSQL, so we started using it right away. It works reliably, and has eliminated another problem from our data pipeline.
I believe that there are two parts to any service: product and support. What’s even more amazing than Treasure Data’s product is their support. Every time we seek their support, we come away impressed with their deep technical expertise and dedication to ensuring their customers' success.
At one point, we noticed that the data transfer from Treasure Data to our PostgreSQL was being done in batches of 1000 rows. As a result, our huge-result query was being imported rather slowly. Once Treasure Data’s support team was contacted, they investigated the situation and promptly increased the batch size to 256KB, speeding up the import process significantly. Talk about thoughtful customer support.
On another occasion, we were having some problems configuring our Fluentd data collector properly. The guys at Treasure Data helped look into our code and fixed the issue for us. When they made a new Fluentd release a couple of weeks later, they took the time to email us directly about the release and even sent us a git patch to apply to our code.
They’ve always given us thorough and timely support peppered with insightful tips to make the best use of their service.
Overall, we have been very happy with Treasure Data’s services (both the product and their support). We would gladly recommend Treasure Data to any teams out there facing problems similar to ours.