Class: RedAmber::Group

Inherits:
Object
  • Object
show all
Includes:
Enumerable, Helper
Defined in:
lib/red_amber/group.rb

Overview

Group class

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(dataframe, *group_keys) ⇒ Group

Creates a new Group object.

Examples:

Group.new(penguins, :species)

# =>
#<RedAmber::Group : 0x000000000000f410>
  species     count
  <string>  <uint8>
0 Adelie        152
1 Chinstrap      68
2 Gentoo        124

Parameters:

  • dataframe (DataFrame)

    dataframe to be grouped.

  • group_keys (Array<Symbol, String>)

    keys for grouping.

Raises:



65
66
67
68
69
70
71
72
73
74
75
# File 'lib/red_amber/group.rb', line 65

def initialize(dataframe, *group_keys)
  @dataframe = dataframe
  @group_keys = group_keys.flatten

  raise GroupArgumentError, 'group_keys are empty.' if @group_keys.empty?

  d = @group_keys - @dataframe.keys
  raise GroupArgumentError, "#{d} is not a key of\n #{@dataframe}." unless d.empty?

  @group = @dataframe.table.group(*@group_keys)
end

Instance Attribute Details

#dataframeDataFrame (readonly)

Source DataFrame.

Returns:



16
17
18
# File 'lib/red_amber/group.rb', line 16

def dataframe
  @dataframe
end

#group_keysArray (readonly)

Keys for grouping by value.

Returns:

  • (Array)

    group keys.



23
24
25
# File 'lib/red_amber/group.rb', line 23

def group_keys
  @group_keys
end

Instance Method Details

#agg_sum(*summary_keys) ⇒ Object

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Aggregating summary.



604
605
606
# File 'lib/red_amber/group.rb', line 604

def agg_sum(*summary_keys)
  call_aggregating_function(:sum, summary_keys, _options = nil)
end

#all(*group_keys) ⇒ DataFrame

Returns aggregated DataFrame.

Examples:

For boolean columns by default.

dataframe

# =>
#<RedAmber::DataFrame : 6 x 3 Vectors, 0x00000000000230dc>
        x y        z
  <uint8> <string> <boolean>
0       1 A        false
1       2 A        true
2       3 B        false
3       4 B        (nil)
4       5 B        true
5       6 C        false

dataframe.group(:y).all

# =>
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000fc08>
  y        all(z)
  <string> <boolean>
0 A        false
1 B        false
2 C        false

Parameters:

  • group_keys (Array<Symbol, String>)

    keys for grouping.

Returns:



111
# File 'lib/red_amber/group.rb', line 111

define_group_aggregation :all

#any(*group_keys) ⇒ DataFrame

Returns aggregated DataFrame.

Examples:

For boolean columns by default.

dataframe.group(:y).any

# =>
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x00000000000117ec>
  y        any(z)
  <string> <boolean>
0 A        true
1 B        true
2 C        false

Parameters:

  • group_keys (Array<Symbol, String>)

    keys for grouping.

Returns:



128
# File 'lib/red_amber/group.rb', line 128

define_group_aggregation :any

#count(*group_keys) ⇒ Object



162
163
164
165
166
167
168
169
# File 'lib/red_amber/group.rb', line 162

def count(*group_keys)
  df = __count(group_keys)
  if df.pick(@group_keys.size..).to_h.values.uniq.size == 1
    df.pick(0..@group_keys.size).rename { [keys[-1], :count] }
  else
    df
  end
end

#count_uniq(*group_keys) ⇒ DataFrame

Count the unique values in each group.

Returns aggregated DataFrame.

Examples:

Show counts for each group.

dataframe.group(:y).count_uniq

# =>
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000011ea04>
  y        count_uniq(x)
  <string>       <int64>
0 A                    2
1 B                    3
2 C                    1

Parameters:

  • group_keys (Array<Symbol, String>)

    keys for grouping.

Returns:



208
# File 'lib/red_amber/group.rb', line 208

define_group_aggregation :count_distinct

#eachEnumerator #each {|df| ... } ⇒ Integer

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Iterates over each record group as a DataFrame or returns a Enumerator.

Overloads:

  • #eachEnumerator

    Returns a new Enumerator if no block given.

    Returns:

    • (Enumerator)

      Enumerator of each group as a DataFrame.

  • #each {|df| ... } ⇒ Integer

    When a block given, passes each record group as a DataFrame to the block.

    Yield Parameters:

    • df (DataFrame)

      passes each record group as a DataFrame by a block parameter.

    Yield Returns:

    • (Object)

      evaluated result value from the block.

    Returns:

    • (Integer)

      group size.



431
432
433
434
435
436
437
438
# File 'lib/red_amber/group.rb', line 431

def each
  return enum_for(:each) unless block_given?

  filters.each do |filter|
    yield @dataframe.filter(filter)
  end
  @filters.size
end

#filtersArray

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Returns Array of boolean filters to select each records in the Group.

Returns:

  • (Array)

    an Array of boolean filter Vectors.



385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
# File 'lib/red_amber/group.rb', line 385

def filters
  @filters ||= begin
    group_values = group_table[group_keys].each_record.map(&:to_a)

    Enumerator.new(group_table.n_rows) do |yielder|
      group_values.each do |values|
        booleans =
          values.map.with_index do |value, i|
            column = @dataframe[group_keys[i]].data
            if value.nil?
              Arrow::Function.find('is_null').execute([column])
            elsif value.is_a?(Float) && value.nan?
              Arrow::Function.find('is_nan').execute([column])
            else
              Arrow::Function.find('equal').execute([column, value])
            end
          end
        filter =
          booleans.reduce do |result, datum|
            Arrow::Function.find('and_kleene').execute([result, datum])
          end
        yielder << Vector.create(filter.value)
      end
    end
  end
end

#group_countDataFrame Also known as: count_all

Returns each record group size as a DataFrame.

Examples:

penguins.group(:species).group_count

# =>
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000003a70>
  species   group_count
  <string>      <uint8>
0 Adelie            152
1 Chinstrap          68
2 Gentoo            124

Returns:

  • (DataFrame)

    DataFrame consists of:

    • Group key columns.

    • Result columns by group aggregation.



188
189
190
# File 'lib/red_amber/group.rb', line 188

def group_count
  DataFrame.create(group_table)
end

#grouped_frameDataFrame Also known as: none

Return grouped DataFrame only for group keys.

Returns:

  • (DataFrame)

    grouped DataFrame projected only for group_keys.

Since:

  • 0.5.0



595
596
597
# File 'lib/red_amber/group.rb', line 595

def grouped_frame
  DataFrame.create(group_table[group_keys])
end

#inspectString

String representation of self.

Examples:

puts penguins.group(:species).inspect

# =>
#<RedAmber::Group : 0x0000000000003a98>
  species   group_count
  <string>      <uint8>
0 Adelie            152
1 Chinstrap          68
2 Gentoo            124

Returns:

  • (String)

    show information of self as a String.



455
456
457
# File 'lib/red_amber/group.rb', line 455

def inspect
  "#<#{self.class} : #{format('0x%016x', object_id)}>\n#{group_count}"
end

#max(*group_keys) ⇒ DataFrame

Returns aggregated DataFrame.

Examples:

dataframe.group(:y).max

# =>
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000014ae74>
  y         max(x)
  <string> <uint8>
0 A              2
1 B              5
2 C              6

Parameters:

  • group_keys (Array<Symbol, String>)

    keys for grouping.

Returns:



158
# File 'lib/red_amber/group.rb', line 158

define_group_aggregation :count

#mean(*group_keys) ⇒ DataFrame

Returns aggregated DataFrame.

Examples:

dataframe.group(:y).mean

# =>
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000138a8>
  y         mean(x)
  <string> <double>
0 A             1.5
1 B             4.0
2 C             6.0

Parameters:

  • group_keys (Array<Symbol, String>)

    keys for grouping.

Returns:



250
# File 'lib/red_amber/group.rb', line 250

define_group_aggregation :mean

#median(*group_keys) ⇒ DataFrame

Returns aggregated DataFrame.

Examples:

dataframe.group(:y).median

# =>
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000138a8>
  y        median(x)
  <string>  <double>
0 A              1.5
1 B              4.0
2 C              6.0

Parameters:

  • group_keys (Array<Symbol, String>)

    keys for grouping.

Returns:



267
# File 'lib/red_amber/group.rb', line 267

define_group_aggregation :approximate_median

#min(*group_keys) ⇒ DataFrame

Returns aggregated DataFrame.

Examples:

dataframe.group(:y).min

# =>
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000018f38>
  y         min(x)
  <string> <uint8>
0 A              1
1 B              3
2 C              6

Parameters:

  • group_keys (Array<Symbol, String>)

    keys for grouping.

Returns:



292
# File 'lib/red_amber/group.rb', line 292

define_group_aggregation :min

#one(*group_keys) ⇒ DataFrame

Returns aggregated DataFrame.

Examples:

dataframe.group(:y).one

# =>
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000002885c>
  y         one(x)
  <string> <uint8>
0 A              1
1 B              3
2 C              6

Parameters:

  • group_keys (Array<Symbol, String>)

    keys for grouping.

Returns:



309
# File 'lib/red_amber/group.rb', line 309

define_group_aggregation :one

#product(*group_keys) ⇒ DataFrame

Returns aggregated DataFrame.

Examples:

dataframe.group(:y).product

# =>
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000021a84>
  y        product(x)
  <string>   <uint64>
0 A                 2
1 B                60
2 C                 6

Parameters:

  • group_keys (Array<Symbol, String>)

    keys for grouping.

Returns:



326
# File 'lib/red_amber/group.rb', line 326

define_group_aggregation :product

#stddev(*group_keys) ⇒ DataFrame

Returns aggregated DataFrame.

Examples:

dataframe.group(:y).stddev

# =>
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x00000000002be6c>
  y        stddev(x)
  <string>  <double>
0 A              0.5
1 B            0.082
2 C              0.0

Parameters:

  • group_keys (Array<Symbol, String>)

    keys for grouping.

Returns:



343
# File 'lib/red_amber/group.rb', line 343

define_group_aggregation :stddev

#sum(*group_keys) ⇒ DataFrame

Returns aggregated DataFrame.

Examples:

dataframe.group(:y).sum

# =>
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000032a14>
  y          sum(x)
  <string> <uint64>
0 A               3
1 B              12
2 C               6

Parameters:

  • group_keys (Array<Symbol, String>)

    keys for grouping.

Returns:



360
# File 'lib/red_amber/group.rb', line 360

define_group_aggregation :sum

#summarize {|group| ... } ⇒ DataFrame #summarize {|group| ... } ⇒ DataFrame #summarize {|group| ... } ⇒ DataFrame

Summarize Group by aggregation functions from the block.

Overloads:

  • #summarize {|group| ... } ⇒ DataFrame

    Summarize by a function.

    Examples:

    Single function and single variable

    group = penguins.group(:species)
    group
    
    # =>
    #<RedAmber::Group : 0x000000000000c314>
      species   group_count
      <string>      <uint8>
    0 Adelie            152
    1 Chinstrap          68
    2 Gentoo            124
    
    group.summarize { mean(:bill_length_mm) }
    
    # =>
    #<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000c364>
      species   mean(bill_length_mm)
      <string>              <double>
    0 Adelie                   38.79
    1 Chinstrap                48.83
    2 Gentoo                    47.5

    Single function only

    group.summarize { mean }
    
    # =>
    #<RedAmber::DataFrame : 3 x 6 Vectors, 0x000000000000c350>
      species   mean(bill_length_mm) mean(bill_depth_mm) ... mean(year)
      <string>              <double>            <double> ...   <double>
    0 Adelie                   38.79               18.35 ...    2008.01
    1 Chinstrap                48.83               18.42 ...    2007.97
    2 Gentoo                    47.5               14.98 ...    2008.08

    Yield Parameters:

    • group (Group)

      passes group object self.

    Yield Returns:

    Returns:

  • #summarize {|group| ... } ⇒ DataFrame

    Summarize by a function.

    Examples:

    Multiple functions

    group.summarize { [min(:bill_length_mm), max(:bill_length_mm)] }
    
    # =>
    #<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000c378>
      species   min(bill_length_mm) max(bill_length_mm)
      <string>             <double>            <double>
    0 Adelie                   32.1                46.0
    1 Chinstrap                40.9                58.0
    2 Gentoo                   40.9                59.6

    Yield Parameters:

    • group (Group)

      passes group object self.

    Yield Returns:

    • (Array<DataFrame>)

      an aggregated DataFrame or an array of aggregated DataFrames.

    Returns:

  • #summarize {|group| ... } ⇒ DataFrame

    Summarize by a function.

    Examples:

    Rename column name by Hash

    group.summarize {
      {
        min_bill_length_mm: min(:bill_length_mm),
        max_bill_length_mm: max(:bill_length_mm),
      }
    }
    
    # =>
    #<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000c378>
      species   min_bill_length_mm max_bill_length_mm
      <string>            <double>           <double>
    0 Adelie                  32.1               46.0
    1 Chinstrap               40.9               58.0
    2 Gentoo                  40.9               59.6

    Yield Parameters:

    • group (Group)

      passes group object self.

    Yield Returns:

    • (Hash{Symbol, String => DataFrame})

      an aggregated DataFrame or an array of aggregated DataFrames. The DataFrame must return only one aggregated column.

    Returns:



549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
# File 'lib/red_amber/group.rb', line 549

def summarize(*args, &block)
  if block
    agg = instance_eval(&block)
    unless args.empty?
      agg = [agg] if agg.is_a?(DataFrame)
      agg = args.zip(agg).to_h
    end
  else
    agg = args
  end

  case agg
  when DataFrame
    agg
  when Array
    aggregations =
      agg.map do |df|
        v = df.vectors[-1]
        [v.key, v]
      end
    agg[0].assign(aggregations)
  when Hash
    aggregations =
      agg.map do |key, df|
        aggregated_keys = df.keys - @group_keys
        if aggregated_keys.size > 1
          message =
            "accept only one column from the Hash: #{aggregated_keys.join(', ')}"
          raise GroupArgumentError, message
        end

        v = df.vectors[-1]
        [key, v]
      end
    agg.values[-1].drop(-1).assign(aggregations)
  else
    raise GroupArgumentError, "Unknown argument: #{agg}"
  end
end

#variance(*group_keys) ⇒ DataFrame

Returns aggregated DataFrame.

Examples:

dataframe.group(:y).variance

# =>
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x00000000003b1dc>
  y        variance(x)
  <string>    <double>
0 A               0.25
1 B              0.067
2 C                0.0

Parameters:

  • group_keys (Array<Symbol, String>)

    keys for grouping.

Returns:



377
# File 'lib/red_amber/group.rb', line 377

define_group_aggregation :variance