Wednesday 10 February 2016

Sentiment Analysis - #iPython #Dato #GraphLab Create - Rotten Tomatoes - Public Kaggle Data

In [1]:
import graphlab
import os 
os.getcwd()
Out[1]:
'C:\\Anaconda2\\envs\\dato-env'
In [2]:
df_1=graphlab.SFrame("train_1.tsv")
[INFO] Start server at: ipc:///tmp/graphlab_server-7380 - Server binary: C:\Anaconda2\envs\dato-env\lib\site-packages\graphlab\unity_server.exe - Server log: C:\Users\Rohit\AppData\Local\Temp\graphlab_server_1455190805.log.0
[INFO] GraphLab Server Version: 1.8.1
PROGRESS: Finished parsing file C:\Anaconda2\envs\dato-env\train_1.tsv
PROGRESS: Parsing completed. Parsed 100 lines in 0.483028 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[long,long,str,long]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file C:\Anaconda2\envs\dato-env\train_1.tsv
PROGRESS: Parsing completed. Parsed 156060 lines in 0.227013 secs.
In [3]:
convert_func = lambda x:1 if x>3 else 0
In [4]:
df_1[100:110]
Out[4]:
PhraseIdSentenceIdPhraseSentiment
1013would have a hard time
sitting through this one ...
1
1023would have a hard time
sitting through this one ...
0
1033would2
1043have a hard time sitting
through this one ...
0
1053have2
1063a hard time sitting
through this one ...
1
1073a hard time1
1083hard time1
1093hard2
1103time2
[10 rows x 4 columns]
In [5]:
df_1[1000:1010]
# within 1000 Rows of the DF - the Sentence ID ahs moved from - 3 to 36 and Phrase ID 110 to 1001
Out[5]:
PhraseIdSentenceIdPhraseSentiment
100136to avoid1
100236avoid0
100337It almost feels as if the
movie is more interested ...
1
100437almost feels as if the
movie is more interested ...
0
100537feels as if the movie is
more interested in ...
1
100637feels as if the movie is
more interested in ...
1
100737feels2
100837as if the movie is more
interested in ...
1
100937if the movie is more
interested in ...
1
101037if2
[10 rows x 4 columns]
In [6]:
df_1['Target'] = df_1['Sentiment'].apply(convert_func)
#
# Creates a Variable named Target - adds a column at end of DF named Target .
# Values of Sentiment are converted as defined in the - "convert_func"
# 
df_1[100:110]
Out[6]:
PhraseIdSentenceIdPhraseSentimentTarget
1013would have a hard time
sitting through this one ...
10
1023would have a hard time
sitting through this one ...
00
1033would20
1043have a hard time sitting
through this one ...
00
1053have20
1063a hard time sitting
through this one ...
10
1073a hard time10
1083hard time10
1093hard20
1103time20
[10 rows x 5 columns]
In [8]:
df_1['word_count'] = graphlab.text_analytics.count_words(df_1['Phrase'])
#
In [12]:
df_1['Phrase'][1:2]
# Seen below we get 1 Row of the DataType String [str] , printed from the Variable Phrase in our Data Frame 
# As seen Frequency of Ocurrence of all Words is - ONCE -- besides  'the' which occurs Twice...
Out[12]:
dtype: str
Rows: 1
['A series of escapades demonstrating the adage that what is good for the goose']
In [9]:
df_1['word_count'][1:2]
# As seen below - a Row of the Dictionary , the dtype: dict - is printed within the Squiggly Braces ..
# Seen below the Frequency of Ocurrence of the Words is -- 1L -- for every word besides 'the' :2L
Out[9]:
dtype: dict
Rows: 1
[{'a': 1L, 'what': 1L, 'good': 1L, 'escapades': 1L, 'for': 1L, 'that': 1L, 'series': 1L, 'is': 1L, 'goose': 1L, 'adage': 1L, 'demonstrating': 1L, 'of': 1L, 'the': 2L}]
In [13]:
df_1['Phrase'][2:4]
# Two Rows are printed out - Row 1 has Two Words == 'A series'
# Row 2 has One Word == 'A'
Out[13]:
dtype: str
Rows: 2
['A series', 'A']
In [10]:
df_1['word_count'][2:4]
# Seen below - Two Row's of Dictionary printed , One Row each within the Squiggly Braces.
# Square Brackets surround COMPLETE output we have sought from the - Data Type == dtype: dict. 
Out[10]:
dtype: dict
Rows: 2
[{'a': 1L, 'series': 1L}, {'a': 1L}]
In [11]:
df_1['word_count'][2:10]
Out[11]:
dtype: dict
Rows: 8
[{'a': 1L, 'series': 1L}, {'a': 1L}, {'series': 1L}, {'what': 1L, 'good': 1L, 'for': 1L, 'escapades': 1L, 'that': 1L, 'of': 1L, 'is': 1L, 'goose': 1L, 'adage': 1L, 'demonstrating': 1L, 'the': 2L}, {'of': 1L}, {'what': 1L, 'good': 1L, 'escapades': 1L, 'for': 1L, 'that': 1L, 'is': 1L, 'goose': 1L, 'adage': 1L, 'demonstrating': 1L, 'the': 2L}, {'escapades': 1L}, {'what': 1L, 'good': 1L, 'for': 1L, 'that': 1L, 'is': 1L, 'goose': 1L, 'adage': 1L, 'demonstrating': 1L, 'the': 2L}]





No comments:

Post a Comment