Class ExternalSampleSorter

All Implemented Interfaces:
SampleConsumer, SampleProcessor, SampleProducer

public class ExternalSampleSorter extends AbstractSampleConsumer
R-way external sample sorter.

This SampleConsumer should be used to sort samples base on a SampleComparator

Samples are sorted with the external sort algorithm. Thus, samples are not all stored in memory to be sorted. Instead, they are sorted by chunk in memory and then written to the disk before being merged at the end.

This sorter makes it possible to sort any number of samples with a fixed amount of memory. Hard disk will be used instead of RAM, at the cost of performance

When parallel mode is enabled and several CPU are available to the JVM, this sorter uses multiple CPU to reduce sort time.
The parallel mode can be disabled if some sort of concurrency issue is encountered.

As a last note, this SampleConsumer can be used as normal class with the different sort() methods

It is important to set the chunkSize property according to the available memory as the algorithm does not take care of memory allocation (samples sizes are not predictable)

Meanwhile, it is equally important to set a SampleComparator to define sample ordering

  • Constructor Details

    • ExternalSampleSorter

      public ExternalSampleSorter()
    • ExternalSampleSorter

      public ExternalSampleSorter(SampleComparator comparator)
  • Method Details

    • setChunkSize

      public void setChunkSize(long chunkSize)
      Set the number of samples that will be stored in memory. This defines the number of samples that will be written in each chunk file before merging step as well.
      chunkSize - The number of samples sorted in memory before they are written to disk. 5000 is the minimum and will be used if given chunkSize is less than 5000
    • setSampleComparator

      public final void setSampleComparator(SampleComparator sampleComparator)
      Set the sample comparator that will define sample ordering
      sampleComparator - comparator to define the ordering
    • setParallelize

      public void setParallelize(boolean parallelize)
      Enabled parallel mode
      parallelize - true to enable, false to disable
    • isParallelize

      public boolean isParallelize()
      true when parallel mode is enabled, false otherwise
    • sort

      public void sort(CsvFile inputFile, File outputFile, boolean writeHeader)
      Sort an input CSV file to an sorted output CSV file.

      The input CSV must have a header otherwise sorting will give unpredictable results

      inputFile - The CSV file to be sorted (must not be null)
      outputFile - The sorted destination CSV file (must not be null)
      writeHeader - Whether the CSV header should be written to the output CSV file
    • sort

      public void sort(SampleMetadata sampleMetadata, File inputFile, File outputFile, boolean writeHeader)
      Sort an input CSV file whose metadata structure is provided. Use this method when input CSV file has no header : header information is then provided through the sampleMetadata parameter.
      sampleMetadata - The CSV metadata : header information + separator (must not be null)
      inputFile - The input file to be sorted (must not be null)
      outputFile - The output sorted file (must not be null)
      writeHeader - Whether output CSV header should be written (based on provided sample metadata)
    • startConsuming

      public void startConsuming()
      Description copied from interface: SampleConsumer
      Start the sample consuming. This step is used by consumer to initialize their process.
    • consume

      public void consume(Sample s, int channel)
      Description copied from interface: SampleConsumer
      Consumes the specified sample ton the specified channel.
      s - The sample to be consumed
      channel - The channel on which the sample is consumed
    • stopConsuming

      public void stopConsuming()
      Description copied from interface: SampleConsumer
      Stops the consuming process. No sample will be processed after this service has been called.
    • sort

      public List<Sample> sort(List<Sample> samples)
    • mergeFiles

      public void mergeFiles(List<File> chunks, SampleMetadata metadata, SampleProducer producer)
    • isRevertedSort

      public final boolean isRevertedSort()
      flag, whether the order of the sort should be reverted
    • setRevertedSort

      public final void setRevertedSort(boolean revertedSort)
      revertedSort - flag, whether the order of the sort should be reverted. false uses the order of the configured SampleComparator