首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
CS1003编程语言讲解、辅导Programming程序、java编程设计调试 解析Haskell程序|解析Haskell程序
项目预算:
开发周期:
发布时间:
要求地区:
University of St Andrews
School of Computer Science
CS1003 — Programming with Data
P1— Text Processing
Deadline: 5 February 2021 Credits: 10% of coursework mark
MMS is the definitive source for deadline and credit details
You are expected to have read and understood all the information in this specification
and any accompanying documents at least a week before the deadline. you must contact
the lecturer regarding any queries well in advance of the deadline.
This practical involves reading data from a file and using basic text processing techniques to solve a
specified problem. You will need to decompose the problem into a number of methods as appropriate
and classes if necessary. You will also need to test your solution carefully and write a report.
Task
The task is to write a Java program to perform string similarity search among words stored in a text file.
The code you are going to write is similar to code that is found in spell-checkers. Your program should
accept two command line arguments, the first is the path of a text file (which contains a dictionary of
commonly used English words) and the second is a query string. The program should then read the
text in the file and split it into lines, where each line contains a single word. Then, calculate a similarity
score between the query word and each word read from the file. Finally, print the closest match from
the file (the word with the highest similarity score) to standard output, together with the similarity
score. Place your main method in a class called CS1003P1.java. Some example runs are as follows.
Searching for the closest word to ‘strawberry’
> java CS1003P1 ../data/words_alpha.txt strawberry
Result: strawberry
Score: 1.0
Searching for the closest word to ‘stravberry’
> java CS1003P1 ../data/words_alpha.txt stravberry
Result: strawberry
Score: 0.6923077
Searching for the closest word to ‘ztravberry’
> java CS1003P1 ../data/words_alpha.txt ztravberry
Result: strawberry
Score: 0.46666667
String similarity
There are several ways of calculating a similarity score between strings, in this practical we ask you to
use a Jaccard index on character bigrams. This might sound scary at first, but don’t worry! We will
now define what we mean and give an example.
Jaccard index
The Jaccard index is a similarity measure between sets of objects. It is calculated by dividing the
size of the intersection of the two sets by the size of the union of the same two sets. If the two sets
are very similar, the value of the Jaccard index will be close to 1 (if the two sets are identical it will
be exactly 1). On the other hand, if the two sets are very dissimilar, the value of the Jaccard index
will be close to 0 (if the two sets are disjoint it will be exactly 0). Try drawing a few simple Venn
diagrams to convince yourselves of this! Wikipedia has a good article on the Jaccard index as well:
https://en.wikipedia.org/wiki/Jaccard_index
Character bigrams
A character bigram is a sequence of two consecutive characters in a string. Bigrams have applications
in several areas of text processing like linguistics, cryptography, speech recognition, and text search. In
this practical, we will calculate the Jaccard index on sets of bigrams for calculating a similarity score
between strings. Following is an example of the set of bigrams for the string ‘cocoa’: ‘co’, ‘oc’, ‘oa’.
Notice that since we generate a set of bigrams, we avoid repeating ‘co’ twice.
Your program should contain a method to create a set of bigrams for a given string.
Top and tail
Adding special characters to the start and the end of a string before calculating the set of bigrams can
improve string similarity search. This is often done by adding a ‘ˆ’ character to the beginning and a
‘$’ character to the end of the string. On the same example, ‘cocoa’, we first add the special characters
to either side and get to ‘ˆcocoa$’. The set of bigrams becomes: ‘ˆc’, ‘co’, ‘oc’, ‘oa’, ‘a$’.
Suggested steps
• Download the text file words alpha.txt 1
from StudRes and save it to a known location. You
should not submit this file as part of your submission.
• Create a Java class called CS1003P1 and write a program that is able to read the data stored in
this text file line by line. In order to check that this works, print each line to standard output.
See method from the class.
• Write a method to calculate character bigrams of a given string and store them in a set. See the
and classes and the ❛❞❞ method that they implement. You may test your method
with the string ‘cocoa’, the output should match the given output above.
• Implement top-and-tail as described above and update your bigram calculation to use this functionality.
• Implement the Jaccard index calculation. Which two sets will you calculate the Jaccard index
on? We suggest that you use the method for implementing set intersection and the
❛❞❞❆❧❧ method for implementing set union. The method returns the size of a set. If you
1Source: https://github.com/dwyl/english-words
2
calculate the Jaccard index between the set {“1”,“2”,“3”} and the set {“1”,“2”,“4”} (where the
size of the intersection is 2 and the size of the union is 4) you should get 2/4 = 0.5 as the result.
• Combining character bigrams, top-and-tail and Jaccard index you now have a way of calculating
a similarity score between two strings. Use this to calculate the score between the query word
and each word from the file in a loop. Keep track of the best score (and the word that has the
best score!) for reporting at the end.
• We suggest that you print the best matching string and the corresponding similarity score as you
iterate through the dictionary during development. This can help you with testing your program.
Auto-checker and Testing
This assignment makes use of the School’s automated checker ‘stacscheck’. You should therefore ensure
that your program can be tested using the auto-checker. It should help you see how well your program
performs on the tests we have made public and will hopefully give you an insight into any issues prior to
submission. The automated checking system is simple to run from the command line in your CS1003-P1
directory:
Make sure to type the command exactly – occasionally copying and pasting from the PDF specification
will not work correctly. If you are struggling to get it working, ask a demonstrator.
The automated checking system will only check the basic operation of your program. It is up to you
to provide evidence that you have thoroughly tested your program.
Submission
Report
Your report must be structured as follows:
• Overview: Give a short overview of the practical: what were you asked to do, and what did you
achieve? Clearly list which parts you have completed, and to what extent.
• Design: Describe the design of your program. Justify the decisions you made. In particular,
describe the classes you chose, the methods they contain, a brief explanation of why you designed
your solution in the way that you did, and any interesting features of your Java implementation.
• Testing: Describe how you tested your program. In particular, describe how you designed different
tests. Your report should include the output from a number of test runs to demonstrate that
your program satisfies the specification. Please note that simply reporting the result of stacscheck
is not enough; you should do further testing and explain in the report how you convinced yourself
that your program works correctly.
• Evaluation: Evaluate the success of your program against what you were asked to do.
• Conclusion: Conclude by summarising what you achieved, what you found difficult, and what
you would like to do given more time.
Don’t forget to add a header including your matriculation number, the name of your tutor and the
date.
3
Upload
Package up your CS1003-P1 folder and a PDF copy of your report into a zip file as in previous weeks,
and submit it using MMS, in the slot for Practical P1. After doing this, it is important to verify
that you have uploaded your submission correctly by downloading it from MMS. You
should also double check that you have uploaded your work to the correct slot. You can
then run stacscheck directly on your zip file to make sure that your code still passes stacscheck. For
example, if your file is called, save it to your Downloads directory and run:
Rubric
Marking
1-6 Very little evidence of work, software which does not compile or run, or crashes
before doing any useful work. You should seek help from your tutor immediately.
7-10 An acceptable attempt to complete the main task with serious problems such as
not compiling, or crashing often during execution.
11-13 A competent attempt to complete the main task. Serious weaknesses such as using
wrong data types, poor code design, weak testing, or a weak report riddled with
mistakes.
14-16 A good attempt to complete the main task together with good code design, testing
and a report.
17-18 Evidence of an excellent submission with no serious defects, good testing, accompanied
by an excellent report.
19-20 An exceptional submission. A correct implementation of the main task, extensive
testing, accompanied by an excellent report. In addition it goes beyond the basic
specification in a way that demonstrates use of concepts covered in class and other
concepts discovered through self-learning.
See also the standard mark descriptors in the School Student Handbook:
http://info.cs.st-andrews.ac.uk/student-handbook/learning-teaching/feedback.html#
Mark_Descriptors
Lateness penalty
The standard penalty for late submission applies (Scheme B: 1 mark per 8 hour period, or part
thereof):
http://info.cs.st-andrews.ac.uk/student-handbook/learning-teaching/assessment.html#
lateness-penalties
Good academic practice
The University policy on Good Academic Practice applies:
https://www.st-andrews.ac.uk/students/rules/academicpractice/
Going Further
Here are some additional questions and pointers for the interested student.
4
• Character bigrams are a special case. If you are interested look into character n-grams which are
sets of n-characters where n is not necessarily 2 (as in bigrams).
• N-grams can be constructed at the word level instead of at the character level. Look into applications
of word-level n-grams and think about the use cases of character-level vs word-level
n-grams.
• There are many other string similarity methods! See https://en.wikipedia.org/wiki/String_
metric as a starting point.
• Think about use cases for string similarity methods.
5
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
cis432代做、代写python/java程...
2024-05-04
eeen3007j代写、c++程序设计代...
2024-05-04
代写data程序、代做c/c++, jav...
2024-05-04
comp2006代做、代写c++程序语言
2024-05-04
comp26020代做、java/c++设计编...
2024-05-04
csci251 advanced programming...
2024-05-03
cs 6290: high-performance co...
2024-05-03
assignment 2: executing and ...
2024-05-03
ecse427/comp310 programmin...
2024-05-03
cs 452 (fall 22): operating...
2024-05-03
comp9414 23t2 assignment 2 ...
2024-05-03
dpst1091 23t1 assignment 2 ...
2024-05-03
program代做、代写python设计编...
2024-05-03
热点标签
finm8007
comp2006
comp26020
comp1721
eeen3007j
cis432
csci251
comp5125m
com398sust
32022
mth6158
comp328
finn41615
2024
mec302
mgmt3004
mgt7158
com160
as.640.440
econ3016
finm7405
econ7021
fin600
infs4205/7205
mktg2510-
f27sb
csse2310/csse7231
rv32i
eecs 113
comp1117b
cs 412
comp 315
econ7300
comp2017
ecs 116
fit5046
com6511
comp30024
acs341
econ1020
isys3014
acc408
comp1047
csc 256
cs 6347
finm7008
comp34212
csmde21
estr2520
comp285/comp220
mds5130/iba6205
finc6010
is3s665
busi2194
125.785
iom209
msin0041
econ339
cmt218
mast10007
comp5349
ecx2953/ecx5953
bios706
comp3310
mth6150
comp30027
comp20005
eec286
busi2211
bff2401
fnce90046
visu2001
mang6554
finc6001
125785
data423-24s1
engi 1331
fint2100
(520|600).666
can202
cs 61b
mast20029
info20003
stat512
econ3208
cmpsc311
engg1340
ecmt1010
fit5216
basc0003
ee3121
acct2002
comp5313
busi2131
ise529
elec372/472
csit940/csit440
cenv6141
comp3027/comp3927
ftec5580
comp1433
msci223
mark203
en3098
eden1000
ece6483
econ4410
mats16302
cs 6476
com6521
comp222
comp3211
comp10002
csc1002
chc6186
cs 161
comp27112
comp282
swen20003
comm1190
elec9764
acfi3308
acct7101
fin6035
comp2048
geog0163
comp2013
coen 146
dts101tc
sehh2042
comp30023
comp4880/8880
cs 455
07
stat0045.
fil-30023
celen085
psyc40005
math40082
are271
comp9311
ee5311
imse2113
comp 2322
acct2102
fnd109
int102
is3s664
is6153
data4000
accfin5034
fit5212
cs536-s24
fit5225
ecos3006
mes202tc
finc5001
stat3061
csc171
cs1b
7ssmm712
bu.450.760
cs170
comp3411
swen90004
cpt206
comp5313/comp4313—large
bl5611
kxo206
comp532
elec207
kxo151
cs 2820
cpt108
math2319
dts204tc
qm222
comp2511
ccs599
infs1001
mat2355
eeee4123
25721
ifn647
pols0010
hpm 573
qbus6860
comp9417
csci 1100
stat0023
cse340
comp2003j
cs 2550
cs360
fin 3080
ierg 4080
cs6238
cit 594
finm7406
hw6
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!