Module: RedAmber::DataFrameCombinable

Included in:
DataFrame
Defined in:
lib/red_amber/data_frame_combinable.rb

Overview

Mix-in for the class DataFrame

Instance Method Summary collapse

Instance Method Details

#anti_join(other, suffix: '.1', force_order: true) ⇒ DataFrame #anti_join(other, join_keys, suffix: '.1', force_order: true) ⇒ DataFrame #anti_join(other, join_key_pairs, suffix: '.1', force_order: true) ⇒ DataFrame

Note:

the order of joined results will be preserved by default. This is enabled by appending index column to sort after joining but it will cause some performance degradation. If you don’t matter the order of the result, set ‘force_order` option to `false`.

Return records of self that do not have a match in other.

  • Same as ‘#join` with `type: :left_anti`

  • A kind of filtering join.

Overloads:

  • #anti_join(other, suffix: '.1', force_order: true) ⇒ DataFrame

    If ‘join_key` is not specified, common keys in self and other are used (natural keys). Returns joined dataframe.

    Examples:

    df = DataFrame.new(KEY: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY           X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other = DataFrame.new(KEY: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY      X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    without key (use implicit common key)

    df.anti_join(other)
    
    # =>
      KEY           X1
      <string> <uint8>
    0 C              3

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Returns:

  • #anti_join(other, join_keys, suffix: '.1', force_order: true) ⇒ DataFrame

    Returns joined dataframe.

    Examples:

    df = DataFrame.new(KEY: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY           X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other = DataFrame.new(KEY: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY      X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    with a key

    df.anti_join(other, :KEY)
    
    # =>
      KEY           X1
      <string> <uint8>
    0 C              3

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • join_keys (String, Symbol, Array<String, Symbol>)

      a key or keys to match.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Returns:

  • #anti_join(other, join_key_pairs, suffix: '.1', force_order: true) ⇒ DataFrame

    Returns joined dataframe.

    Examples:

    df2 = DataFrame.new(KEY1: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY1          X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other2 = DataFrame.new(KEY2: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY2     X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    with key pairs

    df2.anti_join(other2, { left: :KEY1, right: :KEY2 })
    
    # =>
      KEY1          X1
      <string> <uint8>
    0 C              3

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • join_key_pairs (Hash)

      pairs of a key name or key names to match in left and right.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Options Hash (join_key_pairs):

    • :left (String, Symbol, Array<String, Symbol>)

      join keys in ‘self`.

    • :right (String, Symbol, Array<String, Symbol>)

      join keys in ‘other`.

    Returns:

Since:

  • 0.2.3



620
621
622
# File 'lib/red_amber/data_frame_combinable.rb', line 620

def anti_join(other, join_keys = nil, suffix: '.1', force_order: true)
  join(other, join_keys, type: :left_anti, suffix: suffix, force_order: force_order)
end

#concatenate(*other) ⇒ DataFrame Also known as: concat, bind_rows

Note:

the ‘#types` must be same as `other#types`.

Concatenate other dataframes or tables onto the bottom of self.

Examples:

df    = DataFrame.new(x: [1, 2], y: ['A', 'B'])
other = DataFrame.new(x: [3, 4], y: ['C', 'D'])
[df.types, other.types]

# =>
[[:uint8, :string], [:uint8, :string]]

df.concatenate(other)

# =>
        x y
  <uint8> <string>
0       1 A
1       2 B
2       3 C
3       4 D

Parameters:

  • other (DataFrame, Arrow::Table, Array<DataFrame, Arrow::Table>)

    DataFrames or Tables to concatenate.

Returns:

Since:

  • 0.2.3



36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# File 'lib/red_amber/data_frame_combinable.rb', line 36

def concatenate(*other)
  case other
  in [] | [nil] | [[]]
    return self
  in [Array => array]
    # Nop
  else
    array = other
  end

  table_array = array.map do |e|
    case e
    when Arrow::Table
      e
    when DataFrame
      e.table
    else
      raise DataFrameArgumentError, "#{e} is not a Table or a DataFrame"
    end
  end

  DataFrame.create(table.concatenate(table_array))
end

#difference(other) ⇒ DataFrame Also known as: setdiff

Select records appearing in self but not in other.

  • Same as ‘#join` with `type: :left_anti` when keys in self are same with other.

  • A kind of set operations.

Examples:

df3 = DataFrame.new(
  KEY1: %w[A B C],
  KEY2: [1, 2, 3]
)

# =>
  KEY1        KEY2
  <string> <uint8>
0 A              1
1 B              2
2 C              3

other3 = DataFrame.new(
  KEY1: %w[A B D],
  KEY2: [1, 4, 5]
)

# =>
  KEY1        KEY2
  <string> <uint8>
0 A              1
1 B              4
2 D              5
df3.intersect(other3)

# =>
  KEY1        KEY2
  <string> <uint8>
0 B              2
1 C              3

other.intersect(df)

# =>
  KEY1        KEY2
  <string> <uint8>
0 B              4
1 D              5

Parameters:

  • other (DataFrame, Arrow::Table)

    A DataFrame or a Table to be joined with self.

Returns:

Since:

  • 0.2.3



724
725
726
727
728
729
730
# File 'lib/red_amber/data_frame_combinable.rb', line 724

def difference(other)
  unless keys == other.keys.map(&:to_sym)
    raise DataFrameArgumentError, 'keys are not same with self and other'
  end

  join(other, keys, type: :left_anti)
end

#full_join(other, suffix: '.1', force_order: true) ⇒ DataFrame #full_join(other, join_keys, suffix: '.1', force_order: true) ⇒ DataFrame #full_join(other, join_key_pairs, suffix: '.1', force_order: true) ⇒ DataFrame Also known as: outer_join

Note:

the order of joined results will be preserved by default. This is enabled by appending index column to sort after joining but it will cause some performance degradation. If you don’t matter the order of the result, set ‘force_order` option to `false`.

Join another DataFrame or Table, leaving all records.

  • Same as ‘#join` with `type: :full_outer`

  • A kind of mutating join.

Overloads:

  • #full_join(other, suffix: '.1', force_order: true) ⇒ DataFrame

    If ‘join_key` is not specified, common keys in self and other are used (natural keys). Returns joined dataframe.

    Examples:

    df = DataFrame.new(KEY: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY           X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other = DataFrame.new(KEY: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY      X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    without key (use implicit common key)

    df.full_join(other)
    
    # =>
      KEY           X1 X2
      <string> <uint8> <boolean>
    0 A              1 true
    1 B              2 false
    2 C              3 (nil)
    3 D          (nil) (nil)

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Returns:

  • #full_join(other, join_keys, suffix: '.1', force_order: true) ⇒ DataFrame

    Returns joined dataframe.

    Examples:

    df = DataFrame.new(KEY: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY           X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other = DataFrame.new(KEY: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY      X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    with a key

    df.full_join(other, :KEY)
    
    # =>
      KEY           X1 X2
      <string> <uint8> <boolean>
    0 A              1 true
    1 B              2 false
    2 C              3 (nil)
    3 D          (nil) (nil)

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • join_keys (String, Symbol, Array<String, Symbol>)

      a key or keys to match.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Returns:

  • #full_join(other, join_key_pairs, suffix: '.1', force_order: true) ⇒ DataFrame

    Returns joined dataframe.

    Examples:

    df2 = DataFrame.new(KEY1: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY1          X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other2 = DataFrame.new(KEY2: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY2     X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    with key pairs

    df2.full_join(other2, { left: :KEY1, right: :KEY2 })
    
    # =>
      KEY1          X1 X2
      <string> <uint8> <boolean>
    0 A              1 true
    1 B              2 false
    2 C              3 (nil)
    3 D          (nil) (nil)

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • join_key_pairs (Hash)

      pairs of a key name or key names to match in left and right.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Options Hash (join_key_pairs):

    • :left (String, Symbol, Array<String, Symbol>)

      join keys in ‘self`.

    • :right (String, Symbol, Array<String, Symbol>)

      join keys in ‘other`.

    Returns:

Since:

  • 0.2.3



350
351
352
353
# File 'lib/red_amber/data_frame_combinable.rb', line 350

def full_join(other, join_keys = nil, suffix: '.1', force_order: true)
  join(other, join_keys,
       type: :full_outer, suffix: suffix, force_order: force_order)
end

#inner_join(other, suffix: '.1', force_order: true) ⇒ DataFrame #inner_join(other, join_keys, suffix: '.1', force_order: true) ⇒ DataFrame #inner_join(other, join_key_pairs, suffix: '.1', force_order: true) ⇒ DataFrame

Note:

the order of joined results will be preserved by default. This is enabled by appending index column to sort after joining but it will cause some performance degradation. If you don’t matter the order of the result, set ‘force_order` option to `false`.

Join another DataFrame or Table, leaving only the matching records.

  • Same as ‘#join` with `type: :inner`

  • A kind of mutating join.

Overloads:

  • #inner_join(other, suffix: '.1', force_order: true) ⇒ DataFrame

    If ‘join_key` is not specified, common keys in self and other are used (natural keys). Returns joined dataframe.

    Examples:

    df = DataFrame.new(KEY: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY           X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other = DataFrame.new(KEY: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY      X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    without key (use implicit common key)

    df.inner_join(other)
    
    # =>
      KEY           X1 X2
      <string> <uint8> <boolean>
    0 A              1 true
    1 B              2 false

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Returns:

  • #inner_join(other, join_keys, suffix: '.1', force_order: true) ⇒ DataFrame

    Returns joined dataframe.

    Examples:

    df = DataFrame.new(KEY: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY           X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other = DataFrame.new(KEY: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY      X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    with a key

    df.inner_join(other, :KEY)
    
    # =>
      KEY           X1 X2
      <string> <uint8> <boolean>
    0 A              1 true
    1 B              2 false

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • join_keys (String, Symbol, Array<String, Symbol>)

      a key or keys to match.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Returns:

  • #inner_join(other, join_key_pairs, suffix: '.1', force_order: true) ⇒ DataFrame

    Returns joined dataframe.

    Examples:

    df2 = DataFrame.new(KEY1: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY1          X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other2 = DataFrame.new(KEY2: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY2     X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    with key pairs

    df2.inner_join(other2, { left: :KEY1, right: :KEY2 })
    
    # =>
      KEY1          X1 X2
      <string> <uint8> <boolean>
    0 A              1 true
    1 B              2 false

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • join_key_pairs (Hash)

      pairs of a key name or key names to match in left and right.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Options Hash (join_key_pairs):

    • :left (String, Symbol, Array<String, Symbol>)

      join keys in ‘self`.

    • :right (String, Symbol, Array<String, Symbol>)

      join keys in ‘other`.

    Returns:

Since:

  • 0.2.3



280
281
282
# File 'lib/red_amber/data_frame_combinable.rb', line 280

def inner_join(other, join_keys = nil, suffix: '.1', force_order: true)
  join(other, join_keys, type: :inner, suffix: suffix, force_order: force_order)
end

#intersect(other) ⇒ DataFrame

Select records appearing in both self and other.

  • Same as ‘#join` with `type: :inner` when keys in self are same with other.

  • A kind of set operations.

Examples:

df3 = DataFrame.new(
  KEY1: %w[A B C],
  KEY2: [1, 2, 3]
)

# =>
  KEY1        KEY2
  <string> <uint8>
0 A              1
1 B              2
2 C              3

other3 = DataFrame.new(
  KEY1: %w[A B D],
  KEY2: [1, 4, 5]
)

# =>
  KEY1        KEY2
  <string> <uint8>
0 A              1
1 B              4
2 D              5
df3.intersect(other3)

# =>
  KEY1        KEY2
  <string> <uint8>
0 A              1

Parameters:

  • other (DataFrame, Arrow::Table)

    A DataFrame or a Table to be joined with self.

Returns:

Since:

  • 0.2.3



659
660
661
662
663
664
665
# File 'lib/red_amber/data_frame_combinable.rb', line 659

def intersect(other)
  unless keys == other.keys.map(&:to_sym)
    raise DataFrameArgumentError, 'keys are not same with self and other'
  end

  join(other, keys, type: :inner)
end

#join(other, type: :inner, suffix: '.1', force_order: false) ⇒ DataFrame #join(other, join_keys, type: :inner, suffix: '.1', force_order: false) ⇒ DataFrame #joinDataFrame

Note:

the order of joined results may not be preserved by default. if you prefer to preserve the order of the result, set ‘force_order` option to `true`. This is enabled by appending index column to sort after joining so it will cause some performance degradation.

Overloads:

  • #join(other, type: :inner, suffix: '.1', force_order: false) ⇒ DataFrame

    If ‘join_key` is not specified, common keys in self and other are used (natural keys). Returns joined dataframe.

    Examples:

    df = DataFrame.new(KEY: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY           X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other = DataFrame.new(KEY: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY      X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)
    df.join(other)
    
    # =>
      KEY           X1 X2
      <string> <uint8> <boolean>
    0 A              1 true
    1 B              2 false
    
    df.join(other, type: :full_outer)
    
    # =>
      KEY           X1 X2
      <string> <uint8> <boolean>
    0 A              1 true
    1 B              2 false
    2 C              3 (nil)
    3 D          (nil) (nil)

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • type (:left_semi, :right_semi, :left_anti, :right_anti, :inner, left_outer, :right_outer, :full_outer) (defaults to: :inner)

      type of join.

    • force_order (Boolean) (defaults to: false)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Returns:

  • #join(other, join_keys, type: :inner, suffix: '.1', force_order: false) ⇒ DataFrame

    Returns joined dataframe.

    Examples:

    df3 = DataFrame.new(
      KEY1: %w[A B C],
      KEY2: [1, 2, 3]
    )
    
    # =>
      KEY1        KEY2
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other3 = DataFrame.new(
      KEY1: %w[A B D],
      KEY2: [1, 4, 5]
    )
    
    # =>
      KEY1        KEY2
      <string> <uint8>
    0 A              1
    1 B              4
    2 D              5

    join keys in an Array

    df3.join(other3, [:KEY1, :KEY2])
    
    # =>
      KEY1        KEY2
      <string> <uint8>
    0 A              1

    partial join key and suffix

    df3.join(other3, :KEY1, suffix: '.a')
    
    # =>
      KEY1        KEY2  KEY2.a
      <string> <uint8> <uint8>
    0 A              1       1
    1 B              2       4

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • join_keys (String, Symbol, Array<String, Symbol>)

      a key or keys to match.

    • type (:left_semi, :right_semi, :left_anti, :right_anti, :inner, left_outer, :right_outer, :full_outer) (defaults to: :inner)

      type of join.

    • force_order (Boolean) (defaults to: false)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Returns:

  • #joinDataFrame

    other, join_key_pairs, type: :inner, suffix: ‘.1’, force_order: false)

    Examples:

    df4 = DataFrame.new(
      X1: %w[A B C],
      Y: %w[D E F]
    )
    
    # =>
      X1       Y1
      <string> <string>
    0 A        D
    1 B        E
    2 C        F
    
    other4 = DataFrame.new(
      X2: %w[A B D],
      Y:  %w[e E E]
    )
    
    # =>
      X1       Y1
      <string> <string>
    0 A        D
    1 B        E
    2 C        F

    without options

    df4.join(other4)
    
    # =>
      X1       Y        X2
      <string> <string> <string>
    0 B        E        D
    1 B        E        B

    join by key pairs

    df4.join(other4, { left: [:X1, :Y], right: [:X2, :Y] })
    
    # =>
      X1       Y
      <string> <string>
    0 B        E

    join by key pairs, using renaming by suffix

    df4.join(other4, { left: :X1, right: :X2 })
    
    # =>
      X1       Y        Y.1
      <string> <string> <string>
    0 A        D        e
    1 B        E        E

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • join_key_pairs (Hash)

      pairs of a key name or key names to match in left and right.

    • type (:left_semi, :right_semi, :left_anti, :right_anti, :inner, left_outer, :right_outer, :full_outer)

      type of join.

    • force_order (Boolean)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ)

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Returns:

Since:

  • 0.2.3



862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
# File 'lib/red_amber/data_frame_combinable.rb', line 862

def join(other, join_keys = nil, type: :inner, suffix: '.1', force_order: false)
  left_table = table
  right_table =
    case other
    when DataFrame
      other.table
    when Arrow::Table
      other
    else
      raise DataFrameArgumentError, 'other must be a DataFrame or an Arrow::Table'
    end

  if force_order
    left_index = :__LEFT_INDEX__
    right_index = :__RIGHT_INDEX__
    left_table = assign(left_index) { indices }.table
    other = DataFrame.create(other) if other.is_a?(Arrow::Table)
    right_table = other.assign(right_index) { indices }.table
  end

  left_table_keys = ensure_keys(left_table.keys)
  right_table_keys = ensure_keys(right_table.keys)
  # natural keys (implicit common keys)
  join_keys ||= left_table_keys.intersection(right_table_keys)

  type = Arrow::JoinType.try_convert(type) || type
  type_nick = type.nick

  plan = Arrow::ExecutePlan.new
  left_node = plan.build_source_node(left_table)
  right_node = plan.build_source_node(right_table)

  if join_keys.is_a?(Hash)
    left_keys = ensure_keys(join_keys[:left])
    right_keys = ensure_keys(join_keys[:right])
  else
    left_keys = ensure_keys(join_keys)
    right_keys = left_keys
  end

  context =
    [type_nick, left_table_keys, right_table_keys, left_keys, right_keys, suffix]

  hash_join_node_options = Arrow::HashJoinNodeOptions.new(type, left_keys, right_keys)
  case type_nick
  when 'inner', 'left-outer'
    hash_join_node_options.left_outputs = left_table_keys
    hash_join_node_options.right_outputs = right_table_keys - right_keys
  when 'right-outer'
    hash_join_node_options.left_outputs = left_table_keys - left_keys
    hash_join_node_options.right_outputs = right_table_keys
  end

  hash_join_node =
    plan.build_hash_join_node(left_node, right_node, hash_join_node_options)
  merge_node = merge_keys(plan, hash_join_node, context)
  rename_node = rename_keys(plan, merge_node, context)
  joined_table = sink_and_start_plan(plan, rename_node)

  df = DataFrame.create(joined_table)
  if force_order
    sorter =
      case type_nick
      when 'right-semi', 'right-anti'
        [right_index]
      when 'left-semi', 'left-anti'
        [left_index]
      else
        [left_index, right_index]
      end
    df.sort(sorter)
      .drop(sorter)
  else
    df
  end
end

#left_join(other, suffix: '.1', force_order: true) ⇒ DataFrame #left_join(other, join_keys, suffix: '.1', force_order: true) ⇒ DataFrame #left_join(other, join_key_pairs, suffix: '.1', force_order: true) ⇒ DataFrame

Note:

the order of joined results will be preserved by default. This is enabled by appending index column to sort after joining but it will cause some performance degradation. If you don’t matter the order of the result, set ‘force_order` option to `false`.

Join matching values to self from other.

  • Same as ‘#join` with `type: :left_outer`

  • A kind of mutating join.

Overloads:

  • #left_join(other, suffix: '.1', force_order: true) ⇒ DataFrame

    If ‘join_key` is not specified, common keys in self and other are used (natural keys). Returns joined dataframe.

    Examples:

    df = DataFrame.new(KEY: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY           X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other = DataFrame.new(KEY: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY      X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    without key (use implicit common key)

    df.left_join(other)
    
    # =>
      KEY           X1 X2
      <string> <uint8> <boolean>
    0 A              1 true
    1 B              2 false
    2 C              3 (nil)

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Returns:

  • #left_join(other, join_keys, suffix: '.1', force_order: true) ⇒ DataFrame

    Returns joined dataframe.

    Examples:

    df = DataFrame.new(KEY: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY           X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other = DataFrame.new(KEY: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY      X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    with a key

    df.left_join(other, :KEY)
    
    # =>
      KEY           X1 X2
      <string> <uint8> <boolean>
    0 A              1 true
    1 B              2 false
    2 C              3 (nil)

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • join_keys (String, Symbol, Array<String, Symbol>)

      a key or keys to match.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Returns:

  • #left_join(other, join_key_pairs, suffix: '.1', force_order: true) ⇒ DataFrame

    Returns joined dataframe.

    Examples:

    df2 = DataFrame.new(KEY1: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY1          X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other2 = DataFrame.new(KEY2: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY2     X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    with key pairs

    df2.left_join(other2, { left: :KEY1, right: :KEY2 })
    
    # =>
      KEY1          X1 X2
      <string> <uint8> <boolean>
    0 A              1 true
    1 B              2 false
    2 C              3 (nil)

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • join_key_pairs (Hash)

      pairs of a key name or key names to match in left and right.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Options Hash (join_key_pairs):

    • :left (String, Symbol, Array<String, Symbol>)

      join keys in ‘self`.

    • :right (String, Symbol, Array<String, Symbol>)

      join keys in ‘other`.

    Returns:

Since:

  • 0.2.3



420
421
422
# File 'lib/red_amber/data_frame_combinable.rb', line 420

def left_join(other, join_keys = nil, suffix: '.1', force_order: true)
  join(other, join_keys, type: :left_outer, suffix: suffix, force_order: force_order)
end

#merge(*other) ⇒ DataFrame Also known as: bind_cols

Note:

the ‘#size` must be same as `other#size`.

Note:

self and other must not share the same key.

Merge other DataFrames or Tables.

Examples:

df    = DataFrame.new(x: [1, 2], y: [3, 4])
other = DataFrame.new(a: ['A', 'B'], b: ['C', 'D'])
df.merge(other)

# =>
        x       y a        b
  <uint8> <uint8> <string> <string>
0       1       3 A        C
1       2       4 B        D

Parameters:

  • other (DataFrame, Arrow::Table, Array<DataFrame, Arrow::Table>)

    DataFrames or Tables to merge.

Returns:

Raises:

Since:

  • 0.2.3



86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# File 'lib/red_amber/data_frame_combinable.rb', line 86

def merge(*other)
  case other
  in [] | [nil] | [[]]
    return self
  in [Array => array]
    # Nop
  else
    array = other
  end

  hash = array.each_with_object({}) do |e, h|
    df =
      case e
      when Arrow::Table
        DataFrame.create(e)
      when DataFrame
        e
      else
        raise DataFrameArgumentError, "#{e} is not a Table or a DataFrame"
      end

    if size != df.size
      raise DataFrameArgumentError, "#{e} do not have same size as self"
    end

    k = keys.intersection(df.keys).any?
    raise DataFrameArgumentError, "There are some shared keys: #{k}" if k

    h.merge!(df.to_h)
  end

  assign(hash)
end

#right_join(other, suffix: '.1', force_order: true) ⇒ DataFrame #right_join(other, join_keys, suffix: '.1', force_order: true) ⇒ DataFrame #right_join(other, join_key_pairs, suffix: '.1', force_order: true) ⇒ DataFrame

Note:

the order of joined results will be preserved by default. This is enabled by appending index column to sort after joining but it will cause some performance degradation. If you don’t matter the order of the result, set ‘force_order` option to `false`.

Join matching values from self to other.

  • Same as ‘#join` with `type: :right_outer`

  • A kind of mutating join.

Overloads:

  • #right_join(other, suffix: '.1', force_order: true) ⇒ DataFrame

    If ‘join_key` is not specified, common keys in self and other are used (natural keys). Returns joined dataframe.

    Examples:

    df = DataFrame.new(KEY: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY           X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other = DataFrame.new(KEY: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY      X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    without key (use implicit common key)

    df.right_join(other)
    
    # =>
           X1 KEY      X2
      <uint8> <string> <boolean>
    0       1 A        true
    1       2 B        false
    2   (nil) D        (nil)

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Returns:

  • #right_join(other, join_keys, suffix: '.1', force_order: true) ⇒ DataFrame

    Returns joined dataframe.

    Examples:

    df = DataFrame.new(KEY: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY           X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other = DataFrame.new(KEY: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY      X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    with a key

    df.right_join(other, :KEY)
    
    # =>
           X1 KEY      X2
      <uint8> <string> <boolean>
    0       1 A        true
    1       2 B        false
    2   (nil) D        (nil)

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • join_keys (String, Symbol, Array<String, Symbol>)

      a key or keys to match.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Returns:

  • #right_join(other, join_key_pairs, suffix: '.1', force_order: true) ⇒ DataFrame

    Returns joined dataframe.

    Examples:

    df2 = DataFrame.new(KEY1: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY1          X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other2 = DataFrame.new(KEY2: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY2     X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    with key pairs

    df2.right_join(other2, { left: :KEY1, right: :KEY2 })
    
    # =>
            X1 KEY2     X2
      <uint8> >string> <boolean>
    0        1 A        true
    1        2 B        false
    2    (nil) D        (nil)

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • join_key_pairs (Hash)

      pairs of a key name or key names to match in left and right.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Options Hash (join_key_pairs):

    • :left (String, Symbol, Array<String, Symbol>)

      join keys in ‘self`.

    • :right (String, Symbol, Array<String, Symbol>)

      join keys in ‘other`.

    Returns:

Since:

  • 0.2.3



487
488
489
490
491
492
493
494
495
# File 'lib/red_amber/data_frame_combinable.rb', line 487

def right_join(other, join_keys = nil, suffix: '.1', force_order: true)
  join(
    other,
    join_keys,
    type: :right_outer,
    suffix: suffix,
    force_order: force_order
  )
end

#semi_join(other, suffix: '.1', force_order: true) ⇒ DataFrame #semi_join(other, join_keys, suffix: '.1', force_order: true) ⇒ DataFrame #semi_join(other, join_key_pairs, suffix: '.1', force_order: true) ⇒ DataFrame

Note:

the order of joined results will be preserved by default. This is enabled by appending index column to sort after joining but it will cause some performance degradation. If you don’t matter the order of the result, set ‘force_order` option to `false`.

Return records of self that have a match in other.

  • Same as ‘#join` with `type: :left_semi`

  • A kind of filtering join.

Overloads:

  • #semi_join(other, suffix: '.1', force_order: true) ⇒ DataFrame

    If ‘join_key` is not specified, common keys in self and other are used (natural keys). Returns joined dataframe.

    Examples:

    df = DataFrame.new(KEY: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY           X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other = DataFrame.new(KEY: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY      X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    without key (use implicit common key)

    df.semi_join(other)
    
    # =>
      KEY           X1
      <string> <uint8>
    0 A              1
    1 B              2

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Returns:

  • #semi_join(other, join_keys, suffix: '.1', force_order: true) ⇒ DataFrame

    Returns joined dataframe.

    Examples:

    df = DataFrame.new(KEY: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY           X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other = DataFrame.new(KEY: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY      X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    with a key

    df.semi_join(other, :KEY)
    
    # =>
      KEY           X1
      <string> <uint8>
    0 A              1
    1 B              2

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • join_keys (String, Symbol, Array<String, Symbol>)

      a key or keys to match.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Returns:

  • #semi_join(other, join_key_pairs, suffix: '.1', force_order: true) ⇒ DataFrame

    Returns joined dataframe.

    Examples:

    df2 = DataFrame.new(KEY1: %w[A B C], X1: [1, 2, 3])
    
    # =>
      KEY1          X1
      <string> <uint8>
    0 A              1
    1 B              2
    2 C              3
    
    other2 = DataFrame.new(KEY2: %w[A B D], X2: [true, false, nil])
    
    # =>
      KEY2     X2
      <string> <boolean>
    0 A        true
    1 B        false
    2 D        (nil)

    with key pairs

    df2.semi_join(other2, { left: :KEY1, right: :KEY2 })
    
    # =>
      KEY1          X1
      <string> <uint8>
    0 A              1
    1 B              2

    Parameters:

    • other (DataFrame, Arrow::Table)

      A DataFrame or a Table to be joined with self.

    • join_key_pairs (Hash)

      pairs of a key name or key names to match in left and right.

    • force_order (Boolean) (defaults to: true)

      wheather force order of the output always same.

      • This option is used in ‘:full_outer` and `:right_outer`.

      • If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.

    • suffix (#succ) (defaults to: '.1')

      a suffix to rename keys when key names conflict as a result of join. ‘suffix` must be responsible to `#succ`.

    Options Hash (join_key_pairs):

    • :left (String, Symbol, Array<String, Symbol>)

      join keys in ‘self`.

    • :right (String, Symbol, Array<String, Symbol>)

      join keys in ‘other`.

    Returns:

Since:

  • 0.2.3



559
560
561
# File 'lib/red_amber/data_frame_combinable.rb', line 559

def semi_join(other, join_keys = nil, suffix: '.1', force_order: true)
  join(other, join_keys, type: :left_semi, suffix: suffix, force_order: force_order)
end

#set_operable?(other) ⇒ Boolean

Check if set operation with self and other is possible.

Examples:

df3 = DataFrame.new(
  KEY1: %w[A B C],
  KEY2: [1, 2, 3]
)

# =>
  KEY1        KEY2
  <string> <uint8>
0 A              1
1 B              2
2 C              3

other3 = DataFrame.new(
  KEY1: %w[A B D],
  KEY2: [1, 4, 5]
)

# =>
  KEY1        KEY2
  <string> <uint8>
0 A              1
1 B              4
2 D              5
df3.set_operable?(other3) # => true

Parameters:

  • other (DataFrame, Arrow::Table)

    A DataFrame or a Table to be joined with self.

Returns:

  • (Boolean)

    true if set operation is possible.

Since:

  • 0.2.3



637
638
639
# File 'lib/red_amber/data_frame_combinable.rb', line 637

def set_operable?(other) # rubocop:disable Naming/AccessorMethodName
  keys == other.keys.map(&:to_sym)
end

#union(other) ⇒ DataFrame

Select records appearing in self or other.

  • Same as ‘#join` with `type: :full_outer` when keys in self are same with other.

  • A kind of set operations.

Examples:

df3 = DataFrame.new(
  KEY1: %w[A B C],
  KEY2: [1, 2, 3]
)

# =>
  KEY1        KEY2
  <string> <uint8>
0 A              1
1 B              2
2 C              3

other3 = DataFrame.new(
  KEY1: %w[A B D],
  KEY2: [1, 4, 5]
)

# =>
  KEY1        KEY2
  <string> <uint8>
0 A              1
1 B              4
2 D              5
df3.intersect(other3)

# =>
  KEY1        KEY2
  <string> <uint8>
0 A              1
1 B              2
2 C              3
3 B              4
4 D              5

Parameters:

  • other (DataFrame, Arrow::Table)

    A DataFrame or a Table to be joined with self.

Returns:

Since:

  • 0.2.3



689
690
691
692
693
694
695
# File 'lib/red_amber/data_frame_combinable.rb', line 689

def union(other)
  unless keys == other.keys.map(&:to_sym)
    raise DataFrameArgumentError, 'keys are not same with self and other'
  end

  join(other, keys, type: :full_outer, force_order: true)
end