In my last blog, I wrote about how we used the appropriate Regex expression in the Flume script and achieved a significant improvement in the performance of flume.

In another project, we ran into a different problem and realized that in addition to the appropriate Regex, the performance of the flume could be increased drastically in another way.

Context

We were receiving data in the form of CSV files. Each file was very small, just a few KB in size. We were passing the files to the flume, one by one. Initially, when the number of files available for testing was less, we did not notice a problem. The files used to load in a matter of a few seconds.

In the Performance testing phase, we started receiving good volumes of data. As a result, we needed to load 1000’s of files in a few seconds. That is when we noticed that flume could not load the files per our expectations. We could load hardly 60-70 files per minute, which was inadequate.

We knew that HDFS prefers dealing with small numbers of large files rather than vice versa. After some analysis, we realized that the same concept might also apply to flume.

Approach to the Problem

We then introduced a pre-processing step in which we combined multiple small files to form a single big file before passing it to the flume. The results were as expected and astonishing. Concatenating smaller files into bigger files before passing them to flume improved loading times significantly. In one instance, the performance shot up by more than 700%!
Here’s a summary of what we achieved by using different combinations.

Conclusion

  • When files being loaded via flume are small, concatenate them into bigger files and then pass them to flume. Loading times are reduced significantly. This is because the Java overhead of opening and closing each small file is reduced when fewer files are passed.
  • It is essential to test and see what level of concatenation gives optimum results. In our case, concatenating around 2000 files with an average size below 10 Kb each gave good results. Concatenating any further does not expect to give any additional benefits. In fact, after a threshold, it might decrease the loading performance.
  • In India, we expanded our current office, nearly doubling the space.
  • We recently successfully implemented our text analytics framework for automating legal discovery for a leading KPO. This solution will save 85% human effort and result in multi-million dollar cost savings over a 3yr period.

Author: Shubham Shirude

Emergys Blog

Recent Articles

  • Empower Innovation with Seamless Cloudera Support

    Empower Innovation with Seamless Cloudera Support: Driving Excellence Together

    Empower Innovation with Seamless Cloudera Support: Driving Excellence Together

    Digital Transformation leaders face an evolving landscape of growing [...]

    Digital Transformation leaders face an evolving landscape of growing data demands, stringent regulatory & compliance [...]

  • Unlock Insights and Boost Engagement with UnionIQ

    Unlock Insights and Boost Engagement with UnionIQ

    Unlock Insights and Boost Engagement with UnionIQ

    The financial landscape across geographies is seeing a seismic [...]

    The financial landscape across geographies is seeing a seismic shift. Many financial institutions are investing [...]

  • From the Leaders | Nishigandh Pitambare

    From the Leaders | Nishigandh Pitambare | Emergys-SAP Partnership

    From the Leaders | Nishigandh Pitambare | Emergys-SAP Partnership

    As part of the leaders' video series, [...]

    As part of the leaders' video series, we showcase how our two-decade long [...]