In [1]:

import graphlab
import os 
os.getcwd()


Out[1]:

'C:\\Anaconda2\\envs\\dato-env'

In [2]:

df_1=graphlab.SFrame("train_1.tsv")



[INFO] Start server at: ipc:///tmp/graphlab_server-7380 - Server binary: C:\Anaconda2\envs\dato-env\lib\site-packages\graphlab\unity_server.exe - Server log: C:\Users\Rohit\AppData\Local\Temp\graphlab_server_1455190805.log.0
[INFO] GraphLab Server Version: 1.8.1



PROGRESS: Finished parsing file C:\Anaconda2\envs\dato-env\train_1.tsv
PROGRESS: Parsing completed. Parsed 100 lines in 0.483028 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[long,long,str,long]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file C:\Anaconda2\envs\dato-env\train_1.tsv
PROGRESS: Parsing completed. Parsed 156060 lines in 0.227013 secs.


In [3]:

convert_func = lambda x:1 if x>3 else 0


In [4]:

df_1[100:110]


Out[4]:

PhraseIdSentenceIdPhraseSentiment
1013would have a hard time

sitting through this one ...1
1023would have a hard time

sitting through this one ...0
1033would2
1043have a hard time sitting

through this one ...0
1053have2
1063a hard time sitting

through this one ...1
1073a hard time1
1083hard time1
1093hard2
1103time2

[10 rows x 4 columns]

In [5]:

df_1[1000:1010]
# within 1000 Rows of the DF - the Sentence ID ahs moved from - 3 to 36 and Phrase ID 110 to 1001


Out[5]:

PhraseIdSentenceIdPhraseSentiment
100136to avoid1
100236avoid0
100337It almost feels as if the

movie is more interested ...1
100437almost feels as if the

movie is more interested ...0
100537feels as if the movie is

more interested in ...1
100637feels as if the movie is

more interested in ...1
100737feels2
100837as if the movie is more

interested in ...1
100937if the movie is more

interested in ...1
101037if2

[10 rows x 4 columns]

In [6]:

df_1['Target'] = df_1['Sentiment'].apply(convert_func)
#
# Creates a Variable named Target - adds a column at end of DF named Target .
# Values of Sentiment are converted as defined in the - "convert_func"
# 
df_1[100:110]


Out[6]:

PhraseIdSentenceIdPhraseSentimentTarget
1013would have a hard time

sitting through this one ...10
1023would have a hard time

sitting through this one ...00
1033would20
1043have a hard time sitting

through this one ...00
1053have20
1063a hard time sitting

through this one ...10
1073a hard time10
1083hard time10
1093hard20
1103time20

[10 rows x 5 columns]

In [8]:

df_1['word_count'] = graphlab.text_analytics.count_words(df_1['Phrase'])
#


In [12]:

df_1['Phrase'][1:2]
# Seen below we get 1 Row of the DataType String [str] , printed from the Variable Phrase in our Data Frame 
# As seen Frequency of Ocurrence of all Words is - ONCE -- besides  'the' which occurs Twice...


Out[12]:

dtype: str
Rows: 1
['A series of escapades demonstrating the adage that what is good for the goose']

In [9]:

df_1['word_count'][1:2]
# As seen below - a Row of the Dictionary , the dtype: dict - is printed within the Squiggly Braces ..
# Seen below the Frequency of Ocurrence of the Words is -- 1L -- for every word besides 'the' :2L


Out[9]:

dtype: dict
Rows: 1
[{'a': 1L, 'what': 1L, 'good': 1L, 'escapades': 1L, 'for': 1L, 'that': 1L, 'series': 1L, 'is': 1L, 'goose': 1L, 'adage': 1L, 'demonstrating': 1L, 'of': 1L, 'the': 2L}]

In [13]:

df_1['Phrase'][2:4]
# Two Rows are printed out - Row 1 has Two Words == 'A series'
# Row 2 has One Word == 'A'


Out[13]:

dtype: str
Rows: 2
['A series', 'A']

In [10]:

df_1['word_count'][2:4]
# Seen below - Two Row's of Dictionary printed , One Row each within the Squiggly Braces.
# Square Brackets surround COMPLETE output we have sought from the - Data Type == dtype: dict. 


Out[10]:

dtype: dict
Rows: 2
[{'a': 1L, 'series': 1L}, {'a': 1L}]

In [11]:

df_1['word_count'][2:10]


Out[11]:
dtype: dict
Rows: 8
[{'a': 1L, 'series': 1L}, {'a': 1L}, {'series': 1L}, {'what': 1L, 'good': 1L, 'for': 1L, 'escapades': 1L, 'that': 1L, 'of': 1L, 'is': 1L, 'goose': 1L, 'adage': 1L, 'demonstrating': 1L, 'the': 2L}, {'of': 1L}, {'what': 1L, 'good': 1L, 'escapades': 1L, 'for': 1L, 'that': 1L, 'is': 1L, 'goose': 1L, 'adage': 1L, 'demonstrating': 1L, 'the': 2L}, {'escapades': 1L}, {'what': 1L, 'good': 1L, 'for': 1L, 'that': 1L, 'is': 1L, 'goose': 1L, 'adage': 1L, 'demonstrating': 1L, 'the': 2L}]

Data Science with R and Python

Wednesday 10 February 2016

Sentiment Analysis - #iPython #Dato #GraphLab Create - Rotten Tomatoes - Public Kaggle Data

No comments:

Post a Comment

PhraseId	SentenceId	Phrase	Sentiment
101	3	would have a hard time sitting through this one ...	1
102	3	would have a hard time sitting through this one ...	0
103	3	would	2
104	3	have a hard time sitting through this one ...	0
105	3	have	2
106	3	a hard time sitting through this one ...	1
107	3	a hard time	1
108	3	hard time	1
109	3	hard	2
110	3	time	2

PhraseId	SentenceId	Phrase	Sentiment
1001	36	to avoid	1
1002	36	avoid	0
1003	37	It almost feels as if the movie is more interested ...	1
1004	37	almost feels as if the movie is more interested ...	0
1005	37	feels as if the movie is more interested in ...	1
1006	37	feels as if the movie is more interested in ...	1
1007	37	feels	2
1008	37	as if the movie is more interested in ...	1
1009	37	if the movie is more interested in ...	1
1010	37	if	2