TensorFlow-based data import mechanism

Chat about TensorFlow's data import mechanism

Today we are going to talk about the data import mechanism in TensorFlow. The traditional approach is to build a TF graph model first, then open a session and feed the data to the graph before running the graph model. The disadvantage is that the data IO brings a lot of time, so when training a very large data set, it is not recommended to use this method, TensorFlow is replaced by the tf.data.Dataset module, today we focus on this.

Tf.data is a very powerful API that can be used to build complex data import mechanisms. For example, if you are dealing with images, tf.data can help you integrate files distributed in different locations, and Add tiny random noise to each image, and randomly select a part of the image to train as a batch; or if you want to process the text, then tf.data can help parse the symbol from the text and convert it into an embedding matrix, then different lengths The sequence becomes a batch.

We can use tf.data.Dataset to build a dataset. The source of the dataset can be in a variety of ways. For example, if your dataset is pre-written in hard disk format in TFRecord format, then you can use tf.data.TFRecordDataset To build; if your dataset is an in-memory tensor variable, you can build it with tf.data.Dataset.from_tensors() or tf.data.Dataset.from_tensor_slices(). Below I will demonstrate them through code.

First, let's look at the data set from the tensor variable in memory. As shown in the following code, we first build a data set of 0 to 10, and then construct an iterator. The iterator can extract one element from the data set each time:

Import tensorflow as tf dataset=tf.data.Dataset.range(10) iterator=dataset.make_one_shot_iterator() next_element = iterator.get_next()with tf.Session() as sess: for _ in range(10): print(sess .run(next_element))

As shown in the code above, range() is a static function of the tf.data.Dataset class that is used to generate a sequence. It should be noted that the data set to be built needs to be the same data type and internal structure. In addition, since range(10) represents a total of ten numbers from 0 to 9, the iterator here can only be run 10 times, and a tf.errors.OutOfRangeError exception will be thrown later. If you want to not throw an exception, you can call dataset.repeat(count) to implement an iterator that counts automatically.

The range of range can also be determined at runtime, that is, define max_range as the placeholder variable. At this time, we need to call the dataset's make_initializable_iterator method to construct the iterator, and the operation of this iterator needs to be run before the iteration. The code is as follows:

Max_range=tf.placeholder(tf.int64, shape=[]) dataset = tf.data.Dataset.range(max_range) iterator = dataset.make_initializable_iterator() next_element = iterator.get_next()with tf.Session() as sess: Sess.run(iterator.initializer, feed_dict={max_range: 10}) for _ in range(10): print(sess.run(next_element))

You can also create the same iterator for different data sets. In order to make this iterator reusable, you need to ensure that the types and dimensions of different data sets are consistent. For example, the following code demonstrates how to use the same iterator to build a training set and validation set. As you can see, when we start training the training set, we need to execute training_init_op first, in order to make the iterator start loading training data. When verifying, you need to execute validation_init_op first, just like.

Training_data = tf.data.Dataset.range(100).map(lambda x: x+tf.random_uniform([], -10, 10, tf.int64)) validation_data = tf.data.Dataset.range(50) iterator = tf.Iterator.from_structure(training_data.output_types, training_data.output_shapes) iterator = tf.data.Iterator.from_structure(training_data.output_types, training_data.output_shapes) next_element = iterator.get_next() training_init_op=iterator.make_initializer(training_data) validation_init_op= Iterator.make_initializer(validation_data)with tf.Session() as sess: for epoch in range(10): sess.run(training_init_op) for _ in range(100): sess.run(next_element) sess.run(validation_init_op) for _ in range(50): sess.run(next_element)

You can also build tf.data.Dataset through the Tensor variable, as shown in the following code. It should be noted that the dimension of Tensor here is 4×10. Therefore, it can be run 4 times in the iterator, and it is generated every time. A vector of length 10.

Import tensorflow as tf dataset = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4, 10])) iterator = dataset.make_initializable_iterator() next_element = iterator.get_next()with tf.Session() as sess: sess. Run(iterator.initializer) for i in range(4): value = sess.run(next_element) print(value)

Finally, there is a more common way to read data, which is to read from the TFRecord file. Here is a description of the TFRecord read and write code that was previously used in the speech recognition project.

The first is to write the audio features into the TFRecord file. In speech recognition, the two most commonly used features are MFCC and LogFBank. It is not only these two variables to be written into the file, but also the text label Label. And the length of the feature sequence sequence_legnth, among the four variables, only sequence_length is an integer scalar, the other three are list format, so here for the list to use bytes to save, and for scalars, use integer to save.

Def _bytes_feature(value): return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))def _int64_feature(value): return tf.train.Feature(int64_list=tf.train.Int64List(value =[value]))class RecordWriter(object): def __init__(self): pass def write(self, content, tfrecords_filename): writer = tf.python_io.TFRecordWriter(tfrecords_filename) if isinstance(content, list): feature_dict = { } for i in range(len(content)): feature = content[i] if i==0: feature_raw = np.array(feature).tostring() feature_dict['mfccFeat']=_bytes_feature(feature_raw) elif i= =1: feature_raw = np.array(feature).tostring() feature_dict['logfbankFeat']=_bytes_feature(feature_raw) elif i==2: feature_raw = np.array(feature).tostring() feature_dict['label'] =_bytes_feature(feature_raw) else: feature_dict['sequence_length']=_int64_feat Ure(feature) features_to_write = tf.train.Example(features=tf.train.Features(feature=feature_dict)) writer.write(features_to_write.SerializeToString()) writer.close() print('Record has been writen:'+ Tfrecords_filename)

After writing TFRecord, you need to parse the TFRecord format file first when reading. The parsing function is as follows:

Def parse(self, serialized): feature_dict={} feature_dict['mfccFeat']=tf.FixedLenFeature([], tf.string) feature_dict['logfbankFeat']=tf.FixedLenFeature([], tf.string) feature_dict[ 'label']=tf.FixedLenFeature([], tf.string) feature_dict['sequence_length']=tf.FixedLenFeature([1], tf.int64) features = tf.parse_single_example( serialized, features=feature_dict) mfcc = tf .reshape(tf.decode_raw(features['mfccFeat'], tf.float32), [-1, self.feature_num]) logfbank = tf.reshape(tf.decode_raw(features['logfbankFeat'], tf.float32), [-1, self.feature_num]) label = tf.decode_raw(features['label'], tf.int64) return mfcc, logfbank, label, features['sequence_length']

Then we can directly import the TFRecord file list by calling tf.data.TFRecordDataset, and parse the parse function for each file, and since the length of the feature matrix of each file is different, it needs to be aligned for padding operation. Iterator, the code is as follows:

self.fileNameList = tf.placeholder(tf.string, [None, ]) padded_shapes= ([-1,feature_num],[-1,feature_num],[-1],[1]) padded_values ​​= (0.0,0.0, Np.int64(-1), np.int64(0)) dataset = tf.data.TFRecordDataset(self.fileNameList, buffer_size=self.buffer_size).map(self.parse, num_parallel_call).padded_batch(batch_size, padded_shapes, padded_values Self.iterator = tf.data.Iterator.from_structure((tf.float32, tf.float32, tf.int64, tf.int64), (tf.TensorShape([None, None, 60]), tf.TensorShape([ None, None, 60]), tf.TensorShape([None, None]), tf.TensorShape([None, None]))) self.initializer = self.iterator.make_initializer(dataset)

Therefore, the reading and writing of the TFRecord file is completed, and the data import mechanism based on TensorFlow is also introduced.

Active Stylus Pen

Active Stylus Pen,Stylus Pencil,Capacitive Stylus Pen,Tablet Pencil

Shenzhen Ruidian Technology CO., Ltd , https://www.szwisonen.com

Posted on