public class SparkStorageUtils extends Object
Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_MAP_FILE_INTERVAL
By default, a map file's index stores only a fraction of the keys.
|
static String |
MAP_FILE_INDEX_INTERVAL_KEY
Configuration key for the map file interval.
|
Modifier and Type | Method and Description |
---|---|
static org.apache.spark.api.java.JavaPairRDD<Long,List<Writable>> |
restoreMapFile(String path,
org.apache.spark.api.java.JavaSparkContext sc)
Restore a
JavaPairRDD<Long,List<Writable>> previously saved with saveMapFile(String, JavaRDD) }Note that if the keys are not required, simply use restoreMapFile(...).values() |
static org.apache.spark.api.java.JavaPairRDD<Long,List<List<Writable>>> |
restoreMapFileSequences(String path,
org.apache.spark.api.java.JavaSparkContext sc)
Restore a
JavaPairRDD<Long,List<List<Writable>>> previously saved with saveMapFile(String, JavaRDD) }Note that if the keys are not required, simply use restoreMapFileSequences(...).values() |
static org.apache.spark.api.java.JavaRDD<List<Writable>> |
restoreSequenceFile(String path,
org.apache.spark.api.java.JavaSparkContext sc)
Restore a
JavaRDD<List<Writable>> previously saved with saveSequenceFile(String, JavaRDD) |
static org.apache.spark.api.java.JavaRDD<List<List<Writable>>> |
restoreSequenceFileSequences(String path,
org.apache.spark.api.java.JavaSparkContext sc)
Restore a
JavaRDD<List<List<Writable>> previously saved with saveSequenceFileSequences(String, JavaRDD) |
static void |
saveMapFile(String path,
org.apache.spark.api.java.JavaRDD<List<Writable>> rdd)
Save a
JavaRDD<List<Writable>> to a Hadoop MapFile . |
static void |
saveMapFile(String path,
org.apache.spark.api.java.JavaRDD<List<Writable>> rdd,
org.apache.hadoop.conf.Configuration c,
Integer maxOutputFiles)
Save a
JavaRDD<List<Writable>> to a Hadoop MapFile . |
static void |
saveMapFile(String path,
org.apache.spark.api.java.JavaRDD<List<Writable>> rdd,
int interval,
Integer maxOutputFiles)
Save a
JavaRDD<List<Writable>> to a Hadoop MapFile . |
static void |
saveMapFileSequences(String path,
org.apache.spark.api.java.JavaRDD<List<List<Writable>>> rdd)
Save a
JavaRDD<List<List<Writable>>> to a Hadoop MapFile . |
static void |
saveMapFileSequences(String path,
org.apache.spark.api.java.JavaRDD<List<List<Writable>>> rdd,
org.apache.hadoop.conf.Configuration c,
Integer maxOutputFiles)
Save a
JavaRDD<List<List<Writable>>> to a Hadoop MapFile . |
static void |
saveMapFileSequences(String path,
org.apache.spark.api.java.JavaRDD<List<List<Writable>>> rdd,
int interval,
Integer maxOutputFiles)
Save a
JavaRDD<List<List<Writable>>> to a Hadoop MapFile . |
static void |
saveSequenceFile(String path,
org.apache.spark.api.java.JavaRDD<List<Writable>> rdd)
Save a
JavaRDD<List<Writable>> to a Hadoop SequenceFile . |
static void |
saveSequenceFile(String path,
org.apache.spark.api.java.JavaRDD<List<Writable>> rdd,
Integer maxOutputFiles)
Save a
JavaRDD<List<Writable>> to a Hadoop SequenceFile . |
static void |
saveSequenceFileSequences(String path,
org.apache.spark.api.java.JavaRDD<List<List<Writable>>> rdd)
Save a
JavaRDD<List<List<Writable>>> to a Hadoop SequenceFile . |
static void |
saveSequenceFileSequences(String path,
org.apache.spark.api.java.JavaRDD<List<List<Writable>>> rdd,
Integer maxOutputFiles)
Save a
JavaRDD<List<List<Writable>>> to a Hadoop SequenceFile . |
public static final String MAP_FILE_INDEX_INTERVAL_KEY
public static final int DEFAULT_MAP_FILE_INTERVAL
MapFileRecordReader
and MapFileSequenceRecordReader
public static void saveSequenceFile(String path, org.apache.spark.api.java.JavaRDD<List<Writable>> rdd)
JavaRDD<List<Writable>>
to a Hadoop SequenceFile
. Each record is given
a unique (but noncontiguous) LongWritable
key, and values are stored as RecordWritable
instances.
Use restoreSequenceFile(String, JavaSparkContext)
to restore values saved with this method.
path
- Path to save the sequence filerdd
- RDD to savesaveSequenceFileSequences(String, JavaRDD)
,
saveMapFile(String, JavaRDD)
public static void saveSequenceFile(String path, org.apache.spark.api.java.JavaRDD<List<Writable>> rdd, Integer maxOutputFiles)
JavaRDD<List<Writable>>
to a Hadoop SequenceFile
. Each record is given
a unique (but noncontiguous) LongWritable
key, and values are stored as RecordWritable
instances.
Use restoreSequenceFile(String, JavaSparkContext)
to restore values saved with this method.
path
- Path to save the sequence filerdd
- RDD to savemaxOutputFiles
- Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions)
to limit the maximum number of output sequence filessaveSequenceFileSequences(String, JavaRDD)
,
saveMapFile(String, JavaRDD)
public static org.apache.spark.api.java.JavaRDD<List<Writable>> restoreSequenceFile(String path, org.apache.spark.api.java.JavaSparkContext sc)
JavaRDD<List<Writable>>
previously saved with saveSequenceFile(String, JavaRDD)
path
- Path of the sequence filesc
- Spark contextpublic static void saveSequenceFileSequences(String path, org.apache.spark.api.java.JavaRDD<List<List<Writable>>> rdd)
JavaRDD<List<List<Writable>>>
to a Hadoop SequenceFile
. Each record
is given a unique (but noncontiguous) LongWritable
key, and values are stored as SequenceRecordWritable
instances.
Use restoreSequenceFileSequences(String, JavaSparkContext)
to restore values saved with this method.
path
- Path to save the sequence filerdd
- RDD to savesaveSequenceFile(String, JavaRDD)
,
saveMapFileSequences(String, JavaRDD)
public static void saveSequenceFileSequences(String path, org.apache.spark.api.java.JavaRDD<List<List<Writable>>> rdd, Integer maxOutputFiles)
JavaRDD<List<List<Writable>>>
to a Hadoop SequenceFile
. Each record
is given a unique (but noncontiguous) LongWritable
key, and values are stored as SequenceRecordWritable
instances.
Use restoreSequenceFileSequences(String, JavaSparkContext)
to restore values saved with this method.
path
- Path to save the sequence filerdd
- RDD to savemaxOutputFiles
- Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions)
to limit the maximum number of output sequence filessaveSequenceFile(String, JavaRDD)
,
saveMapFileSequences(String, JavaRDD)
public static org.apache.spark.api.java.JavaRDD<List<List<Writable>>> restoreSequenceFileSequences(String path, org.apache.spark.api.java.JavaSparkContext sc)
JavaRDD<List<List<Writable>>
previously saved with saveSequenceFileSequences(String, JavaRDD)
path
- Path of the sequence filesc
- Spark contextpublic static void saveMapFile(String path, org.apache.spark.api.java.JavaRDD<List<Writable>> rdd)
JavaRDD<List<Writable>>
to a Hadoop MapFile
. Each record is
given a unique and contiguous LongWritable
key, and values are stored as
RecordWritable
instances.MapFileRecordReader
DEFAULT_MAP_FILE_INTERVAL
, which is usually suitable for
use cases such as MapFileRecordReader
. Use
saveMapFile(String, JavaRDD, int, Integer)
or saveMapFile(String, JavaRDD, Configuration, Integer)
to customize this.
Use restoreMapFile(String, JavaSparkContext)
to restore values saved with this method.
path
- Path to save the MapFilerdd
- RDD to savesaveMapFileSequences(String, JavaRDD)
,
saveSequenceFile(String, JavaRDD)
public static void saveMapFile(String path, org.apache.spark.api.java.JavaRDD<List<Writable>> rdd, int interval, Integer maxOutputFiles)
JavaRDD<List<Writable>>
to a Hadoop MapFile
. Each record is
given a unique and contiguous LongWritable
key, and values are stored as
RecordWritable
instances.MapFileRecordReader
Use restoreMapFileSequences(String, JavaSparkContext)
to restore values saved with this method.
path
- Path to save the MapFilerdd
- RDD to saveinterval
- The map file index interval to use. Smaller values may result in the faster look up, at the
expense of more memory/disk use. However, usually the increase is relatively minor, due to
keys being stored as LongWritable objectsmaxOutputFiles
- Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions)
to limit the maximum number of output map filessaveMapFileSequences(String, JavaRDD)
,
saveSequenceFile(String, JavaRDD)
public static void saveMapFile(String path, org.apache.spark.api.java.JavaRDD<List<Writable>> rdd, org.apache.hadoop.conf.Configuration c, Integer maxOutputFiles)
JavaRDD<List<Writable>>
to a Hadoop MapFile
. Each record is
given a unique and contiguous LongWritable
key, and values are stored as
RecordWritable
instances.MapFileRecordReader
Use restoreMapFileSequences(String, JavaSparkContext)
to restore values saved with this method.
path
- Path to save the MapFilerdd
- RDD to savec
- Configuration object, used to customise options for the map filemaxOutputFiles
- Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions)
to limit the maximum number of output map filessaveMapFileSequences(String, JavaRDD)
,
saveSequenceFile(String, JavaRDD)
public static org.apache.spark.api.java.JavaPairRDD<Long,List<Writable>> restoreMapFile(String path, org.apache.spark.api.java.JavaSparkContext sc)
JavaPairRDD<Long,List<Writable>>
previously saved with saveMapFile(String, JavaRDD)
}restoreMapFile(...).values()
path
- Path of the MapFilesc
- Spark contextpublic static void saveMapFileSequences(String path, org.apache.spark.api.java.JavaRDD<List<List<Writable>>> rdd)
JavaRDD<List<List<Writable>>>
to a Hadoop MapFile
. Each record is
given a unique and contiguous LongWritable
key, and values are stored as
SequenceRecordWritable
instances.MapFileSequenceRecordReader
DEFAULT_MAP_FILE_INTERVAL
, which is usually suitable for
use cases such as MapFileSequenceRecordReader
. Use
saveMapFileSequences(String, JavaRDD, int, Integer)
or saveMapFileSequences(String, JavaRDD, Configuration, Integer)
to customize this.
Use restoreMapFileSequences(String, JavaSparkContext)
to restore values saved with this method.
path
- Path to save the MapFilerdd
- RDD to savesaveMapFileSequences(String, JavaRDD)
,
saveSequenceFile(String, JavaRDD)
public static void saveMapFileSequences(String path, org.apache.spark.api.java.JavaRDD<List<List<Writable>>> rdd, int interval, Integer maxOutputFiles)
JavaRDD<List<List<Writable>>>
to a Hadoop MapFile
. Each record is
given a unique and contiguous LongWritable
key, and values are stored as
SequenceRecordWritable
instances.MapFileSequenceRecordReader
Use restoreMapFileSequences(String, JavaSparkContext)
to restore values saved with this method.
path
- Path to save the MapFilerdd
- RDD to saveinterval
- The map file index interval to use. Smaller values may result in the faster look up, at the
expense of more memory/disk use. However, usually the increase is relatively minor, due to
keys being stored as LongWritable objectssaveMapFileSequences(String, JavaRDD)
,
saveSequenceFile(String, JavaRDD)
public static void saveMapFileSequences(String path, org.apache.spark.api.java.JavaRDD<List<List<Writable>>> rdd, org.apache.hadoop.conf.Configuration c, Integer maxOutputFiles)
JavaRDD<List<List<Writable>>>
to a Hadoop MapFile
. Each record is
given a unique and contiguous LongWritable
key, and values are stored as
SequenceRecordWritable
instances.MapFileSequenceRecordReader
Use restoreMapFileSequences(String, JavaSparkContext)
to restore values saved with this method.
path
- Path to save the MapFilerdd
- RDD to savec
- Configuration object, used to customise options for the map filesaveMapFileSequences(String, JavaRDD)
,
saveSequenceFile(String, JavaRDD)
public static org.apache.spark.api.java.JavaPairRDD<Long,List<List<Writable>>> restoreMapFileSequences(String path, org.apache.spark.api.java.JavaSparkContext sc)
JavaPairRDD<Long,List<List<Writable>>>
previously saved with saveMapFile(String, JavaRDD)
}restoreMapFileSequences(...).values()
path
- Path of the MapFilesc
- Spark contextCopyright © 2020. All rights reserved.