### Own - Conda venv --- dc_info_venv
# Source --- https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564
### main Source --- https://www.tensorflow.org/guide/
#
import tensorflow as tf
#from tf.keras import layers ### Fails - We have TF version == 1.5.0
import math
import numpy as np
import h5py
import matplotlib.pyplot as plt
from tensorflow.python.framework import ops
#from tf_utils import load_dataset, random_mini_batches, convert_to_one_hot, predict
%matplotlib inline
np.random.seed(1)
#
print(tf.VERSION)
print(tf.keras.__version__)
import keras
print('Keras: {}'.format(keras.__version__))
In [2]:
# Source --- https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564
"""
A TFRecord file stores your data as a sequence of binary strings.
This means you need to specify the structure of your data before you write it to the file.
Tensorflow provides two components for this purpose:
tf.train.Example and
tf.train.SequenceExample.
You have to store each sample of your data in one of these structures,
then ----serialize-------- it and use a tf.python_io.TFRecordWriter to write it to disk.
"""
## DHANKAR --- FATT --- Some other sources mentinn getting IMAGES in as NUMPY ARRAYS ?
## SOURCE ---- https://www.tensorflow.org/api_docs/python/tf/data
"""
The tf.data API enables you to build complex input pipelines from simple, reusable pieces.
For example, the pipeline for an image model might aggregate data from files in a ---- distributed file system,
apply random perturbations to each image, and ------- merge randomly selected images ---- into a batch for training.
The pipeline for a text model might involve extracting symbols from raw text data, converting
them to embedding identifiers with a ----lookup table-----, and -----batching together sequences----
of different lengths.
The tf.data API makes it easy to deal with large amounts of data,
different data formats, and complicated transformations.
"""
### tensor_1 == image_data
### tensor_2 == image_label
"""
A tf.data.Dataset represents a sequence of elements, in which each element contains one or more ---Tensor-- objects.
For example, in an--- image pipeline, an element might be a ----single training example---, with a pair of tensors
representing the image data and a label.
"""
### Dataset.from_tensor_slices()
### Dataset.batch()
"""
Creating a source (e.g. Dataset.from_tensor_slices()) constructs a dataset from one or more tf.Tensor objects.
Applying a transformation (e.g. Dataset.batch()) constructs a dataset from one or more tf.data.Dataset objects.
"""
### tf.data.Iterator
"""
A tf.data.Iterator provides the main way to extract elements from a dataset.
The operation returned by Iterator.get_next() yields the next element of a Dataset when executed,
and typically acts as the interface between input pipeline code and your model.
The simplest iterator is a "one-shot iterator", which is associated with a particular Dataset and
iterates through it once.
For more sophisticated uses, the Iterator.initializer operation enables you to reinitialize
and parameterize an iterator with different datasets,
so that you can, for example,
iterate over training and validation data multiple times in the same program.
"""
### Dataset structure
# --- dataset >> elements >> tf.Tensor -- components >> tf.TensorShape
"""
Dataset structure
A dataset comprises ---elements--- that each have the same structure.
An element contains one or more ----tf.Tensor objects---, called ----components---.
----- Each component has a tf.DType representing the type of elements in the tensor
----- and a tf.TensorShape representing the (possibly partially specified) static shape of each element.
"""
### PROPERTIES ===>> Dataset.output_types and Dataset.output_shapes
"""
The Dataset.output_types and Dataset.output_shapes properties
----allow you to inspect the inferred types
----and shapes of each component of a dataset element.
The nested structure of these properties map to the structure of an element,
--- which may be a single tensor,
--- a tuple of tensors,
--- or a nested tuple of tensors.
"""
###
"""
"""
In [4]:
dataset1 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4, 1000]))
print(dataset1.output_types) # ==> "tf.float32"
print(dataset1.output_shapes) # ==> "(10,)"
#
print(dataset1)
In [6]:
dataset2 = tf.data.Dataset.from_tensor_slices(
(tf.random_uniform([4]),
tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)))
print(dataset2.output_types) # ==> "(tf.float32, tf.int32)"
print(dataset2.output_shapes) # ==> "((), (100,))"
#
print(dataset2)
In [5]:
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
print(dataset3.output_types) # ==> (tf.float32, (tf.float32, tf.int32))
print(dataset3.output_shapes) # ==> "(10, ((), (100,)))"
In [ ]:
"""
It is often convenient to give names to each component of an element,
for example if they represent different features of a training example.
In addition to tuples, you can use collections.namedtuple or a dictionary mapping strings to tensors
to represent a single element of a Dataset.
"""
In [9]:
### Official
dataset = tf.data.Dataset.from_tensor_slices(
{"a": tf.random_uniform([4]),
"b": tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)})
print(dataset.output_types) # ==> "{'a': tf.float32, 'b': tf.int32}"
print(dataset.output_shapes) # ==> "{'a': (), 'b': (100,)}"
In [12]:
## DHANKAR ---
dataset_11 = tf.data.Dataset.from_tensor_slices(
{
"a": tf.random_uniform([4, 500], maxval=1000, dtype=tf.int32),
"b": tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)
}
)
print(dataset_11.output_types) # ==> "{'a': tf.float32, 'b': tf.int32}"
print(dataset_11.output_shapes) # ==> "{'a': (), 'b': (100,)}"
In [13]:
### CSV Uploads --ERROR --- FATT
# latest version of TF == has the CSV Func
## Documentation for version --- 1.12
# https://www.tensorflow.org/api_docs/python/tf/contrib/data/CsvDataset
#v-1.6.0 --- Has Experimental --
##- tensorflow/tensorflow/python/data/experimental/benchmarks/csv_dataset_benchmark.py
# Right now using - v-1.5.0 --- which does not .
# /a6_18/OwnFork_TensorFlow/tensorflow/tensorflow/contrib/data/python/ops/readers.py
# Creates a dataset that reads all of the records from two CSV files, each with
# eight float columns
filenames = ["/media/dhankar/Dhankar_1/a6_18/Tensors_et_al/date_fmts.csv"]
record_defaults = [tf.float32] * 8 # Eight required float columns
dataset = tf.contrib.data.CsvDataset(filenames, record_defaults)
In [ ]:
##FATT --- CSV OnHold
# Source --- https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564
## https://github.com/tensorflow/tensorflow/blob/r1.5/tensorflow/core/example/example.proto
"""
If your dataset consist of features, where each feature is a list of values of the same type,
tf.train.Example is the right component to use.
We have a number of features,
each being a list where every entry has the same data type.
In order to store these features in a TFRecord,
we fist need to create the lists that constitute the features.
tf.train.BytesList
tf.train.FloatList
tf.train.Int64List
are at the core of a tf.train.Feature.
All three have a single attribute value, which expects a list of respective
--- bytes,
--- float,
--- int.
"""
### tf.train.Feature
"""
tf.train.Feature --- wraps a list of data of a specific type so Tensorflow can understand it.
It has a single attribute, which is a ---union of ----bytes_list/float_list/int64_list.
Being a union, the stored list can be of type
--- tf.train.BytesList (attribute name bytes_list),
--- tf.train.FloatList (attribute name float_list),
--- tf.train.Int64List (attribute name int64_list).
tf.train.Features ----PLURAL----Features---- is a collection of named features.
It has a single attribute feature that expects a dictionary where the --- key ----is the name of the features
---- and the value a tf.train.Feature.
"""
"""
In our example, each TFRecord represents the movie ratings and corresponding suggestions
of a single user (a single sample).
Writing recommendations for all users in the dataset follows the same process.
It is important that the type of a feature (e.g. float for the movie rating) is the same across all samples
in the dataset.
This conformance criterion and others are defined in the protocol buffer definition of tf.train.Example.
"""
In [5]:
# Create example data
# Source --- https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564
data = {
'Age': 29,
'Movie': ['The Shawshank Redemption', 'Fight Club'],
'Movie Ratings': [9.0, 9.7],
'Suggestion': 'Inception',
'Suggestion Purchased': 1.0,
'Purchase Price': 9.99
}
print(data)
In [5]:
# Create the Example
# Source --- https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564
example = tf.train.Example(features=tf.train.Features(feature={
'Age': tf.train.Feature(
int64_list=tf.train.Int64List(value=[data['Age']])),
'Movie': tf.train.Feature(
bytes_list=tf.train.BytesList(
value=[m.encode('utf-8') for m in data['Movie']])),
'Movie Ratings': tf.train.Feature(
float_list=tf.train.FloatList(value=data['Movie Ratings'])),
'Suggestion': tf.train.Feature(
bytes_list=tf.train.BytesList(
value=[data['Suggestion'].encode('utf-8')])),
'Suggestion Purchased': tf.train.Feature(
float_list=tf.train.FloatList(
value=[data['Suggestion Purchased']])),
'Purchase Price': tf.train.Feature(
float_list=tf.train.FloatList(value=[data['Purchase Price']]))
}))
print(example)
In [6]:
# Write TFrecord file
with tf.python_io.TFRecordWriter('customer_1.tfrecord') as writer:
#
writer.write(example.SerializeToString())
In [7]:
# Read and print data:
sess = tf.InteractiveSession()
# Read TFRecord file
reader = tf.TFRecordReader()
filename_queue = tf.train.string_input_producer(['customer_1.tfrecord'])
_, serialized_example = reader.read(filename_queue)
# Define features
read_features = {
'Age': tf.FixedLenFeature([], dtype=tf.int64),
'Movie': tf.VarLenFeature(dtype=tf.string),
'Movie Ratings': tf.VarLenFeature(dtype=tf.float32),
'Suggestion': tf.FixedLenFeature([], dtype=tf.string),
'Suggestion Purchased': tf.FixedLenFeature([], dtype=tf.float32),
'Purchase Price': tf.FixedLenFeature([], dtype=tf.float32)}
# Extract features from serialized data
read_data = tf.parse_single_example(serialized=serialized_example,
features=read_features)
# Many tf.train functions use tf.train.QueueRunner,
# so we need to start it before we read
tf.train.start_queue_runners(sess)
# Print features
for name, tensor in read_data.items():
print('{}: {}'.format(name, tensor.eval()))
In [7]:
# Create example data
data1 = {
# Context
'Locale': 'pt_BR',
'Age': 19,
'Favorites': ['Majesty Rose', 'Savannah Outen', 'One Direction'],
# Data
'Data': [
{ # Movie 1
'Movie Name': 'The Shawshank Redemption',
'Movie Rating': 9.0,
'Actors': ['Tim Robbins', 'Morgan Freeman']
},
{ # Movie 2
'Movie Name': 'Fight Club',
'Movie Rating': 9.7,
'Actors': ['Brad Pitt', 'Edward Norton', 'Helena Bonham Carter']
}
]
}
print(data1)
In [10]:
# Create the context features (short form)
customer = tf.train.Features(feature={
'Locale': tf.train.Feature(bytes_list=tf.train.BytesList(
value=[data1['Locale'].encode('utf-8')])),
'Age': tf.train.Feature(int64_list=tf.train.Int64List(
value=[data1['Age']])),
'Favorites': tf.train.Feature(bytes_list=tf.train.BytesList(
value=[m.encode('utf-8') for m in data1['Favorites']]))
})
# Create sequence data
names_features = []
ratings_features = []
actors_features = []
for movie in data1['Data']:
# Create each of the features, then add it to the
# corresponding feature list
movie_name_feature = tf.train.Feature(
bytes_list=tf.train.BytesList(
value=[movie['Movie Name'].encode('utf-8')]))
names_features.append(movie_name_feature)
movie_rating_feature = tf.train.Feature(
float_list=tf.train.FloatList(value=[movie['Movie Rating']]))
ratings_features.append(movie_rating_feature)
movie_actors_feature = tf.train.Feature(
bytes_list=tf.train.BytesList(
value=[m.encode('utf-8') for m in movie['Actors']]))
actors_features.append(movie_actors_feature)
movie_names = tf.train.FeatureList(feature=names_features)
movie_ratings = tf.train.FeatureList(feature=ratings_features)
movie_actors = tf.train.FeatureList(feature=actors_features)
movies = tf.train.FeatureLists(feature_list={
'Movie Names': movie_names,
'Movie Ratings': movie_ratings,
'Movie Actors': movie_actors
})
# Create the SequenceExample
example = tf.train.SequenceExample(context=customer, feature_lists=movies)
print(example)
In [11]:
# Write TFrecord file
with tf.python_io.TFRecordWriter('customer_2.tfrecord') as writer:
writer.write(example.SerializeToString())
In [12]:
# Read and print data:
sess = tf.InteractiveSession()
# Read TFRecord file
reader = tf.TFRecordReader()
filename_queue = tf.train.string_input_producer(['customer_1.tfrecord'])
_, serialized_example = reader.read(filename_queue)
# Define features
context_features = {
'Locale': tf.FixedLenFeature([], dtype=tf.string),
'Age': tf.FixedLenFeature([], dtype=tf.int64),
'Favorites': tf.VarLenFeature(dtype=tf.string)
}
sequence_features = {
'Movie Names': tf.FixedLenSequenceFeature([], dtype=tf.string),
'Movie Ratings': tf.FixedLenSequenceFeature([], dtype=tf.float32),
'Movie Actors': tf.VarLenFeature(dtype=tf.string)
}
# Extract features from serialized data
context_data, sequence_data = tf.parse_single_sequence_example(
serialized=serialized_example,
context_features=context_features,
sequence_features=sequence_features)
# Many tf.train functions use tf.train.QueueRunner,
# so we need to start it before we read
tf.train.start_queue_runners(sess)
# Print features
print('Context:')
for name, tensor in context_data.items():
print('{}: {}'.format(name, tensor.eval()))
print('\nData')
for name, tensor in sequence_data.items():
print('{}: {}'.format(name, tensor.eval()))
In [13]:
#### DHANKAR ---- customer_2.tfrecord
#
# Read and print data:
sess = tf.InteractiveSession()
# Read TFRecord file
reader = tf.TFRecordReader()
filename_queue = tf.train.string_input_producer(['customer_2.tfrecord'])
_, serialized_example = reader.read(filename_queue)
# Define features
context_features = {
'Locale': tf.FixedLenFeature([], dtype=tf.string),
'Age': tf.FixedLenFeature([], dtype=tf.int64),
'Favorites': tf.VarLenFeature(dtype=tf.string)
}
sequence_features = {
'Movie Names': tf.FixedLenSequenceFeature([], dtype=tf.string),
'Movie Ratings': tf.FixedLenSequenceFeature([], dtype=tf.float32),
'Movie Actors': tf.VarLenFeature(dtype=tf.string)
}
# Extract features from serialized data
context_data, sequence_data = tf.parse_single_sequence_example(
serialized=serialized_example,
context_features=context_features,
sequence_features=sequence_features)
# Many tf.train functions use tf.train.QueueRunner,
# so we need to start it before we read
tf.train.start_queue_runners(sess)
# Print features
print('Context:')
for name, tensor in context_data.items():
print('{}: {}'.format(name, tensor.eval()))
print('\nData')
for name, tensor in sequence_data.items():
print('{}: {}'.format(name, tensor.eval()))
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
No comments:
Post a Comment