Skip to content
Toggle navigation
P
Projects
G
Groups
S
Snippets
Help
lazy-programmer-courses
/
nlp
This project
Loading...
Sign in
Toggle navigation
Go to a project
Project
Repository
Issues
0
Merge Requests
0
Pipelines
Wiki
Snippets
Members
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Commit
675c29bb
authored
6 years ago
by
Paktalin
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Edited sms spam detector
parent
050c8495
master
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
55 additions
and
0 deletions
sms_spam_detector_17.py
sms_spam_detector_17.py
0 → 100644
View file @
675c29bb
import
numpy
as
np
import
pandas
as
pd
import
matplotlib.pyplot
as
plt
from
sklearn.feature_extraction.text
import
CountVectorizer
from
sklearn.naive_bayes
import
MultinomialNB
from
wordcloud
import
WordCloud
def
train_test_split
(
X
,
Y
,
test_size
):
test_size
=
int
(
test_size
*
X
.
shape
[
0
])
Xtrain
=
X
[:
-
test_size
]
Xtest
=
X
[
-
test_size
:]
Ytrain
=
Y
[:
-
test_size
]
Ytest
=
Y
[
-
test_size
:]
return
Xtrain
,
Xtest
,
Ytrain
,
Ytest
def
visualize
(
label
):
words
=
''
for
msg
in
df
[
df
[
'labels'
]
==
label
][
'data'
]:
msg
=
msg
.
lower
()
words
+=
msg
+
' '
word_cloud
=
WordCloud
(
width
=
600
,
height
=
400
)
.
generate
(
words
)
plt
.
imshow
(
word_cloud
)
plt
.
axis
(
'off'
)
plt
.
show
()
df
=
pd
.
read_csv
(
'./files/sms_spam.csv'
,
encoding
=
'ISO-8859-1'
)
df
=
df
.
drop
([
'Unnamed: 2'
,
'Unnamed: 3'
,
'Unnamed: 4'
],
axis
=
1
)
df
.
columns
=
[
'labels'
,
'data'
]
df
[
'b_labels'
]
=
df
[
'labels'
]
.
map
({
'ham'
:
0
,
'spam'
:
1
})
Y
=
df
[
'b_labels'
]
.
values
count_vectorizer
=
CountVectorizer
(
decode_error
=
'ignore'
)
X
=
count_vectorizer
.
fit_transform
(
df
[
'data'
])
Xtrain
,
Xtest
,
Ytrain
,
Ytest
=
train_test_split
(
X
,
Y
,
test_size
=
0.33
)
model
=
MultinomialNB
()
model
.
fit
(
Xtrain
,
Ytrain
)
print
(
'Train score is'
,
model
.
score
(
Xtrain
,
Ytrain
))
print
(
'Test score is'
,
model
.
score
(
Xtest
,
Ytest
))
visualize
(
'spam'
)
visualize
(
'ham'
)
df
[
'predictions'
]
=
model
.
predict
(
X
)
sneaky_spam
=
df
[(
df
[
'b_labels'
]
==
1
)
&
(
df
[
'predictions'
]
==
0
)][
'data'
]
for
msg
in
sneaky_spam
:
print
(
msg
)
print
(
'
\n\n
'
)
not_actually_spam
=
df
[(
df
[
'b_labels'
]
==
0
)
&
df
[
'predictions'
]
==
1
][
'data'
]
for
msg
in
not_actually_spam
:
print
(
msg
)
\ No newline at end of file
This diff is collapsed.
Click to expand it.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment