按时间 按人气 按推荐

  • 当前第 1 页,共 380  个影片
  • 2021-08-24数据结构:#4 数组vs链表In our previous lesson, we introduced you to linked list data structure上节课 我们介绍了链表这种数据结构 and we saw how linked lists solve some of the problems that we have with arrays.也讲了链表如何解决数组存在的一些问题 So now the obvious question would be which one is better – an array or a linked list.那么 数组和链表哪个更好呢? Well, there is no such thing as one data structure is better than another data structure.其实 不能说某种数据结构要比另一种更好 One data structure may be really good for one kind of requirement,因为一种数据结构可能适用于某一种需求 while another data structure can be really good for another kind of requirement.而另一种数据结构适用于另一种需求 So, it all depends upon factor like what is the most frequent operation that you want这取决于一些因素 比如对数据最频繁的操作是什么 to perform with the data structure or what is the size of the data或者待处理的数据量是多大 and there can be other factors as well.当然也有其他一些因素 So, in this lesson, we will compare these two data structures based on some parameters,那么 本节课 我们将根据参数和操作复杂度 based on the cost of operations that we have with these data structures.对这两种数据结构进行比较 So, all in all we will comparatively study the advantages and disadvantages and try to总之 我们会用对比的方式来研究两种结构的优缺点 understand in which scenario we should use an array了解哪种情况下应该用数组 and in which scenario we should use a linked list.哪种情况下应该用链表 So, I will draw two columns here, one for array and another for linked list我来画一个双栏对比图 一栏数组 一栏链表 and the first parameter that we want to talk about is the cost of accessing an element.我们要讨论的第一个参数是访问元素的复杂度 Irrespective of the size of an array不论数组的大小如何 it takes constant time to access an element in the array.只需要常数时间就可以访问数组中的元素 This is because an array is stored as one contiguous block of memory.这是因为数组在内存中是连续存储的 So, if we know the starting address or the base address of this block of memory.只要我们知道内存块的起始地址 即基地址 Let us say what we have here is an integer array and base address is 200.假设有一个整型数组 基地址是200 The first byte in this array is at address 200.数组第一个字节的地址就是200 Then let’s say if we want to calculate the address of element at index i,那么如果要计算第i个元素的元素的地址 then it will be equal to 200 plus i into size of an integer in bytes.它等于200加上i个元素的字节大小 So, size of an integer in bytes is typically 4 bytes.一个整型数据的大小通常是4个字节 So, it will be 200 + 4*i.那么 就是200 + 4*i So, if 0th element is at address 200如果第0个元素的地址是200 if we want to calculate the address for element我们要计算第6个元素的地址 at index 6, it will be 200 plus 6 into 4 which will be equal to 224.那就是200加6乘4等于224 So, knowing address of any element in an array is just this calculation for our application.因此 任意元素的位置都可以通过计算得到 In terms of big-oh notation, constant time is also called O(1).在大O表示法中 常数时间又记作O(1) So, accessing an element in an array is O(1) in terms of time complexity.因此 访问数组元素的时间复杂度是O(1) If you are not aware of big-oh notation如果不了解大O表示法 check the description of this video for a tutorial on time complexity analysis.视频下方说明里有时间复杂度分析的教程 Now, in a linked list, data is not stored in a contiguous block of memory.而对于链表 数据并不是储存在连续的内存块中 So, if we have a linked list something like this, let’s say we have a linked list of integers here,如果我们有一个这样的链表 假定它是个int型链表 then we have multiple blocks of memory at different addresses.它被储存在许多地址不同的内存块中 Each block in the linked list is called a node and each node has two fields,每个内存块称为一个结点 每个结点包含两个数据域 one to store the data and one to store the address of the next node.一个储存数据 一个储存下一结点的地址 So, we call the second field, the link field.我们称后者为连接域 The only information that we keep with us about a linked list is the address of the first node我们只需保留链表第一个节点的地址 which is also called the head node.即头结点的地址 And this is what we keep passing to all the functions also, the address of the head node.传递给各个函数的也是头结点的地址 To access an element in the linked list at a particular position要访问链表中特定位置的元素 we first need to start at the head node or the first node需要从头结点 或第一个结点开始遍历 then we go to the second node and see the address of the third node.到第二个结点 然后得到第三个结点的地址 In the worst case, to access the last element in the list,最坏的情况是访问链表的最后一个结点 we will be traversing all the elements in the list.这就需要遍历整个链表的所有元素 In the average case, we will be accessing the middle element may be.平均情况下 我们要访问链表中间的元素 So, if n is the size of the linked list, n is the number of elements in the linked list,假设链表的大小为n 即链表有n个元素 then we will traverse n/2 elements.那么就要遍历n/2个元素 So, the time taken in the average case因此 平均情况时所需的时间 also is proportional to the number of elements in the linked list.也与链表的元素数量成正比 So, we can say that the time complexity in average case is O(n).因此 平均情况的时间复杂度是O(n) So, on this parameter, cost of accessing an element, array scores heavily over linked list.对于访问元素的复杂度 数组远优于链表 So if you have a requirement where you want to access elements in the list all the time,如果你需要频繁访问列表元素 then definitely array is a better choice.数组绝对是更好的选择 Now, the second parameter that we want to talk about is接下来 我们要讨论的第二个参数是 memory requirement or memory usage.内存需求 或称内存使用情况 with an array, we need to know the size of the array before creating it,创建数组之前 我们需要知道它的大小 because arrays is created as one contiguous block of memory.因为数组需要占用一段连续的内存块 So, array is of fixed size.因此 数组大小是固定不变的 What we typically do is create, we create a large enough array常用的方法是创建一个足够大的数组 and some part of the array stores our list一部分空间用来存储数据列表 and some part of the array is vacant or empty so that we can add more elements in the list.其他部分空着 以备存放后续数据 For example, we have an array of 7 integers here and we have only 3 integers in the list.例如 一个大小为7的整型数组 里面只有3个整型变量 Rest 4 positions are unused.其他4个位置没有用到 There would be some garbage value there.这就会产生一些无用变量 With linked list, lets say we have, let’s say we have this linked list of integers,对于链表 我们可以说 there is no unused memory.一个整型链表不会造成内存浪费 We ask memory for one node at a time, so we do not keep any reserved space.每次只申请一个结点的内存 无需预留 But we use extra memory for pointer variables但需要额外的内存来储存指针变量 and this extra memory requirement for a pointer variable in a linked list can not be ignored.链表中指针变量的内存需求不能忽视 In a typical architecture let’s say在典型的指令集架构中 integer is stored in 4 bytes and pointer variable also takes 4 bytes.整型数据占用4字节 指针也是4字节 So, if you see, the memory requirement for this array of 7 integers is 28 bytes.那么储存7个整型变量的数组占用28个字节 And the memory requirement for this linked list would be 8*3,而用链表来存放则需要8*3个字节 where 8 is the size of each node, 4 for integer and 4 bytes for the pointer variable.其中 8是每个结点的大小 即4字节整数和4字节指针 So, this is also 24 bytes.也就是24个字节 If we add one more element to the list in the array, we will just use one more position,如果增加一个元素 只需占用数组一个空位 while in linked list we will create one more node, and will take another 8 bytes,而链表中 得创建新结点 又需要8个字节 so this will be 32 bytes.总共占用32个字节 Linked list would fetch us a lot of advantage if the data, the data part is large in size.如果要存放大量数据 链表会占很大优势 So, in this case, we had a linked list of integers, so integer is only 4 bytes.刚才是整型链表中数据仅占4个字节的情况 What if we had a linked list in which the data part was那如果有一个链表 数据域更为复杂 some complex type that took 16 bytes.占用了16个字节 又会怎样呢? So, 4 bytes for the link and 16 bytes for the data, each node would have been 20 bytes.4字节指针 16字节数据 每个结点共20字节 An array of 7 elements for 16 bytes of data would be而拥有7个元素的数组 16 byte for each element would be 112 bytes.每个元素16字节 共112字节 And linked list of 4 would be only 80 bytes.而4个结点的链表只需80字节 So, it all depends.因此 要视情况而定 If the data part of a list takes a lot of memory,如果列表的数据部分占用很大内存 linked list will definitely consume lot less memory.用链表可以减少很多内存消耗 Otherwise, it depends what strategy we are choosing to decide the size of the array.反之 所占内存多少由数组大小决定 At any time how much array are we keep unused.不论何时 数组都需要预留内存 Now, one more point with memory allocation另外 关于内存分配还要补充一个知识点 because arrays are created as one contiguous block of memory,因为数组存储在连续的内存块中 sometimes when we may want to create a really really large array, then有时可能需要创建一个特别大的数组 maybe memory may not be available as one large block,可能无法储存在一整块内存中 but if we are using linked list, memory may be available as multiple small blocks.但用链表的话 数据可以储存在零散的内存块中 So, we will have this problem of fragmentation in the memory.我们把这种情况称为内存碎片化 Sometimes, we may get many small units of memory有时可能只是许多小的内存单元 but we may not get one large block of memory.却不是一个大的内存块 This may be a rare phenomenon, but this is a possibility.这种情况或许不常见 但存在这种可能性 So, this is also where linked list scores.因此 这也是链表的优势 Because arrays have fixed size, once array gets filled and we need more memory, then由于数组的大小不变 一旦填满 需要更多内存时 there is no other option than to create a new array of larger size就只能创建另一个更大的数组 and copy the content from the previous array into the new array.并且把原数组的内容复制到新数组中 So, this is also one cost which is not there with linked list.这也是用数组代替链表的代价 So, we need to keep these constraints and these requirements in mind when we want to因此 根据需求选用数据结构时 decide for one of these data structures for our requirement.需要牢记这些限制和要求 Now, the third parameter that we want to talk about is接着 我们要讨论的第三个参数是 cost of inserting an element in the list.在列表中插入元素的复杂度 Remember when we are talking about arrays here, we are also talking about请记住 这里我们讨论的数组包括 the possible use of array as dynamic list.可能用到的动态分配空间的数组 So, there can be 3 scenarios in insertion.那么有3种插入数据的形式 First scenario will be when we want to insert an element at the beginning of the list.第一种是在列表首部插入数据 Let’s see we want to insert number 3 at the beginning of the list.假设我们要在列表首部插入数字3 In the case of arrays,对数组而言 we will have to shift each element by one position towards the higher index.不得不把每个元素都依次后移一位 So, the time taken will be proportional to the size of the list.因此 所需时间与数组大小成正比 So, this will be O(n).也就是O(n) Let’s say n is the size of the list.假设列表大小是n This will be O(n) in terms of time complexity.时间复杂度就是O(n) In the case of linked list, inserting a node in the beginning而对于链表而言 要在首部插入一个结点 will mean only creating a new node and adjusting the head pointer and the link of this new node.只需创建新的结点 调整头结点和新结点的指针 So, the time taken will not depend upon the size of the list, it will be constant.因此 所需时间是常数 与列表大小无关 So, for linked list所以 对链表来说 inserting an element at the beginning is O(1) in terms of time complexity.在开头插入元素的时间复杂度为O(1) Inserting an element at end for an array, let’s say we are talking about dynamic array,如果在尾部插入元素 这里讨论的是动态数组 a dynamic list in which we create a new array if it gets filled.如果填满了就新建一个空间的动态列表 If there is space in the array, we just write to the next higher index of the list.如果数组中还有空位 只需继续写入元素 So, it will be constant time.这样只需要常数时间 So, time complexity is O(1) if array is not full.即数组未满时的时间复杂度为O(1) If array is full, we will have to create a new array and copy all the previous content若数组已满 需新建数组并复制原有数据 into new array which will take O(n) time where n is the size of the list.若数组大小为n 时间复杂度就是O(n) In the case of linked list, adding an element, inserting an element at the end will mean对链表来说 在末尾添加一个元素意味着 traversing the whole list and then creating a new node and adjusting the links.遍历整个链表 新建一个结点并调整指针 So, time taken will be proportional to n.因此 所需时间与n成正比 I will use this color coding for linked list.我用红色标上链表的时间复杂度 Here n is the number of elements in the list.这里的n是列表中元素的数量 Now, the third case will be when we want to insert in the middle of the list第三种是在列表中间插入元素 at some nth position or may be some ith position.比如在第n个位置 或者第i个位置 Again in the case of arrays, we will have to shift elements.对数组而言 仍需将元素依次后移 For the average case, we may want to insert at the mid position in the array.平均情况下 需要在数组正中间插入元素 So, will have to shift n/2 where n is the number of elements in the list.因此 数组大小为n 则需要后移n/2个元素 So, the time taken is definitely proprotional to n in average case.平均情况下 所需时间与n成正比 So, complexity will be O(n).因此 时间复杂度为O(n) For linked list also we will have to traverse对链表而言 仍然需要遍历 till that position and then only we can adjust the links.直到找出位置 然后仅需调整指针 Even though we will not have any shifting, we will have to traverse till that point and虽然不需要后移 但也要遍历到那个位置 in the average case, time taken will be proportional to n平均情况下 所需时间也和n成正比 and the time complexity will be O(n).即时间复杂度为O(n) If you see , deleting an element will also have these 3 sceanrios显然 删除一个元素 也分为3种形式 and the time comeplxity for deleting for these 3 sceanrios will also be the same.其时间复杂度也和插入是相同的 And the final point, the final parameter that I want to talk about最后一个需要讨论的参数是 is which one is easy to use and implement.哪种结构更容易使用和实现 An array definitely is a lot easier to use.显然 数组更容易使用 Linked list implemetation especially in C or C++ is more prone to errors链表更容易出现错误 尤其是在C或C++中 like segmentation fault and memory leaks.比如段错误和内存泄漏等问题 It takes good care to work with linked lists.使用链表时 需要多加注意 So, this was arrays vs linked lists.这就是数组和链表的对比 In our next lesson, we will implement linked list in C or C++.下节课 我们将在C或C++中应用链表 We will get our hands dirty with some real code.我们会在真实的编程环境中进行演练 So this is it for this lesson.以上是本节课的全部内容 Thanks for Watching !感谢观看!
  • 2021-08-24《机器学习Python实践》#20 支持向量机概述与应用What is going on everybody?大家好 Welcome to another machine learning with Python tutorial.欢迎观看本集《机器学习Python实践》 In this tutorial, we’re gonna be talking about another supervised machine learning classifier.本集介绍另一个有监督机器学习分类器 And that is the support vector machine, or SVM.支持向量机 或者叫SVM Right out of the gate,开始之前 you should think about what is support vector machine.先考虑一下支持向量机是什么东西? What does this even mean?这几个词是什么意思? Well first of all,首先 machine probably just has some sort of relation机器可能就是和某种系统 to like a system or something like this.有点关系的东西 But then we have support what? Vector.但是支持什么? 向量 So that should tell us we’re gonna be dealing with vectors and in vector space.大概意思就是我们会遇到向量空间中的向量 And then we see we have support.但是还有个支持 We’re not really sure what the heck that’s supposed to mean.我们还不太确定支持到底是什么意思 But generally, we’re gonna probably guess that it goes together.大致可以猜测这两个词可能是连在一起的 So we’ve got support vectors.连在一起就是支持向量 We’re not really sure what those are yet,尽管我们还不确定是什么意思 but we will figure it out.但是会搞明白的 So we’ve got a support vector machine.全部连在一起叫支持向量机 We are looking probably for some support vectors,我们得知道什么是支持向量 and we’re recognizing that we are working in vector space.而且要明确我们需要用到向量空间 Okay. So the support vector machine was created by Vladimir Vapnik好的 支持向量机是Vladimir Vapnic back in the 60s actually,60年代发明的 but went largely ignored and overlooked until the 90s.直到90年代才受到重视 Vapnik was in the USSR at that time,那时Vapnik还在苏联 and then in the 90s he got moved over to the United States,90年代移居美国 working with Bell Labs.在贝尔实验室工作 And in the 90s this is when it was shown支持向量机在90年代 that the support vector machine was better than the neural network被证明在手写数字识别等领域 at like handwritten number recognition, stuff like this.优于神经网络 And then basically the SVM became the most popular machine learning algorithm for a while.然后SVM一时间就成了最流行的机器学习算法 And it’s still one of the most popular machine learning algrithms.现在仍然是最流行的机器学习算法之一 So first, let’s just talk about the really high level intuition首先 咱们先从比较高的角度 of the support vector machine.谈谈对支持向量机的直观认识 And then we’ll show an actual example of us using it.然后用实例示范一下使用方法 And then we’ll break it down to explain how it does what it does.再将其分解 讲解它的工作机制 It turns out the support vector machine, I would argue,我认为 支持向量机 is probably one of the most complex machine learning algrithms.可能是最复杂的机器学习算法之一 So again, we’re in vector space. So let’s draw some vector space.重申一下 我们处于向量空间 先画一个向量空间 and then the support vector machine is what’s called a binary classifier.支持向量机被称作二元分类器 so it separates only into two groups at a time,它一次只分出两组 but it’s important to not confused this with但是不要混淆成支持向量机只能分成两组 that a support vector machine can only classify into two groups.这点很重要 It just will separate two groups,它会(把样本)分成两组 or basically separate groups to really one at a time, one from the rest.或者说一次只分出一组 另一组从剩下的样本中分出 I’ll explain that more when we dive in,深入学习后我再详细解释 but for now just understand it’s a binary classifier,暂时先知道它是个二元分类器 and what we want to do is to separate into two groups.我们要做的是分出两组 And these two groups are denoted as positive or negative.把这两组标记成正或负 Generally there gonna be a positive one or a negative one.总之就是有一组正和一组负 So let’s say you’ve got two groups of data.假设现在有了两组数据 So you’ve got some positives.有些是正的 So we’ll draw postive, positive, positive.我们画几个正号 正号 正号 And then you gonna have some negatives.然后还有一些是负号 So I’ll draw a negative, negative and a negative.画上负号 负号 负号 And the object of the SVM is to find the best separating hyperplane,SVM的目标就是找出最佳的分类超平面 which is also referred to as your desicion boundry.也就是决策边界 So the best separating hyperplane or desicion boundry最佳的分类超平面或者说决策边界 that will separate these data.会把这些数据进行分类 So for now, it’s gonna look a lot like a line,在这里看起来很有可能就是一条线 and I think that’s why a lot people can get confused,我想这里会让很多人迷惑 because the SVMs are always depicted or usually depicted in two dimensional space,因为SVM通常被描绘在二维平面上 at most three dimensional space.最多在三维空间中 So the SVM is gonna go.那么SVM要开始工作了 It’s gonna run and it’s gonna say,运行 给出结果 “I found the best separating hyperplane.”“我找到最佳分类超平面了” And it is like this.就像这样 OK. And it would be a straight line if I could draw right.我画得太烂了 应该是条直线 Anyways, so it’s saying that’s the best separating hyperplane,好 这就是最佳分类超平面 and it’s gonna say that is the best为什么是最佳呢 because the distance between that hyperplane and the associated data因为超平面与其分割的关联数据之间 that it’s separating is the greatest.距离最大 So if you were to take a perpendicular bisector of the hyperplane,如果用最近的数据点 and draw that to the closest data points.画一条超平面的中垂线 let’s say you’d have like this, this and this.就像这样 这样和这样 OK. That distance is the greatest distance这个距离就是用超平面 that you could come up with, with a separating hyperplane.所能得到的最大距离 So for example another seperating hyperplane might be this. Right?假如还有一个分类超平面是这样 That’s a separating hyperplane. But the distance,这也是个超平面 但是 again the perpendicular distance between those is this,它和数据的垂直距离是这样 like…oops, this. Right?这样 是不是? Much smaller than our greenish-yellow counterparts. OK.比黄绿色的线距离小很多 So the support vector machine says,支持向量机判定 that yellow greenish line is the best separating hyperplane for our data.黄绿色的线就是数据的最佳分类超平面 Now how we get to that is another story.怎样得到超平面是另外一回事 But for now, let’s just remember暂时咱们只要记住 that is the…let’s say that’s the best separating hyperplane.这个就是最佳分类超平面 Once you acquire the best separating hyperplane,只要得到了最佳分类超平面 you can now take in unknow data.就可以添加未知数据了 So let’s say you have an unknown data point.假如有这么个未知数据 So unknow u.未知的u Because that u rests on the right hand side of that separating hyperplane,因为u位于分类超平面的右侧 we’re gonna say it’s not unknown. It’s a positive sample.我们就可以判定它不是未知的了 判定u为正样本 Conversely if you had an unknown that was down here,反之如果下面这有个未知的样本 it’s now not unknown. It’s on the left hand side of this hyperplane.那么它也不是未知的了 它在超平面的左侧 Therefore, it’s actually a negative sample.所以就是负样本 So that is the intuition of the support vector machine,前面就是支持向量机的直观解释 we just need to find the best separating hyperplane.我们只要用它来找到最佳分类超平面 And then from there we can classify new datapoints.然后就可以对新的数据点进行分类 Again, I would just iterate, it’s a binary classifier再次强调 SVM是二元分类器 and natively we want it to be against linear data.我们自然而然的想用它来分类线性数据 So, for example,比如说 what is the best separating hyperplane?这样的数据最佳分类超平面是什么样? Well, turns out we can’t do that.不能像刚才那么操作了 Where is that hyperplane?分类超平面在哪? It doesn’t exist.不存在 So, in theory you can’t have one,理论上是没有的 but wouldn’t it be nice if we can do like…但是如果可以这样岂不是很妙 like that? Right?这样 Everything that is on the left side of this line is a minus,线条左边的都是负的 and everything that’s on the right side is a plus右边的都是正的 So I’ll leave it to you to determine这样能不能实现 whether or not that’s even possible.交给你来判断 And we will answer that question. I promise.我会回答这个问题的 我保证 But for now, let’s actually apply the support vector machine in a more simple example.但是暂时把支持向量机用到更简单的例子上 We’re gonna be using the same example that we did before with the breast cancer dataset.我们要用之前用过的乳腺癌数据集来做 So let’s go ahead and hop over there.直接跳过去 OK. So now we’re gonna do is,好 我们现在要做的是 we’re gonna take the same code we used把之前用过的代码拿来 back in part 14 with the K nearest neighbors algorithm.也就是第14部分K邻域算法的内容 If you don’t have that,如果你没有这部分代码 go to the text version of this tutorial on pythonprogramming.net,去pythonprpgramming.net找本教程的文字版 and you can just copy and paste. It’ll be right at the top.直接复制粘贴 文字版一开始就是 So here is the code.这就是代码了 This is the code we used with the K nearest neighbors,我们在讲解K邻域算法时用过 now use in the support vector machine.现在要把它用在支持向量机里 Supper simple. We’re gonna import support vector machine from sklearn.超简单 from sklearn import svm And then we’re gonna come down to the classifier.来到下面这里的分类器clf We’re not gonna be using K nearest neighbors.这里不用K邻域语算法了 We’re gonna use svm. SVC for support vector classifier.要用支持向量机的分类器 也就是svm.SVC And then we’re gonna use empty parameters there.这里采用空参数 There are a lot of parameters that we can modify.这个函数有很多参数可以编辑 But in order to know what to modify, or what it’s doing,为了了解哪个参数可以编辑 或者有什么用 we need to break the algorithm first.我们得把算法分解开来 So for now we just use the defaults.我们暂时只用默认的设置 And let’s go ahead and run it.开始运行吧 That’s all we have to do. Super simple to change algorithms.就这些 改算法超级简单 So there we go. We ran it.开始 运行 92% accuracy. This is the classification that we just kind of made up.准确率92% 这就是刚才的分类效果 And it did it right.分类正确 Let’s run it one more time. 92 seems kind of low.再运行一次 92%似乎有点低 There we go 96. OK.这次是96% 好的 So, that is the support vector machine in practice.这就是就是支持向量机的实践部分了 Again, all we did was just to take the data,再说一次 我们只需要拿到数据 repalce some of the data with some outliers, which later,用离群值替换一部分数据 之后…… now you see how different the support vector mchine handles the outliers现在能看到支持向量机和K邻域方法相比 as opposed to the K nearest neighbors.在处理离群值时的不同之处 What if we did not drop the id column?如果不把id这列拿掉会怎么样? So let’s do that. And then we’re just not gonna make a prediction.咱们试一试 先不要做预测 So if we don’t drop the id column,如果不去掉id列 we’re relegated back to about 65% accuracy.准确率下降到了65% So it does better than the K nearest neighbors did,所以当保留id列时 when we left in the id column.结果还是比K邻域要好 But it still has a huge impact.但它还是有很大影响 So id was just kind of a useless data column.id这列数据其实没什么用 Oh that time we got 99% accuracy. Nice.这次准确率99% 漂亮 Anyway, so…总之 So that’s… Anyway, so we dropped the id column that is useless我们把没用的id这列去掉 And our X’s are just everything except for the class,X是除class以外的数据 即特征 or features or everything but the actual label itself.或者说除标签本身之外的数据 The y’s are our label.y就是数据的标签 We do the train test,我们训练测试数据集 call the classifier, train the classifier,调用分类器 训练分类器 and then test it. Simple enough.然后测试 相当简单 So that is the theory and application of the support vector machine.上面就是支持向量机的理论和应用 Now we’re gonna break it down.接下来我们将其分解开讲解 So if you have any questions, comments, concerns up to this point,如果有任何疑问 评论 关心的问题 feel free to leave them below,欢迎在下面留言 otherwise stay tuned in the next video.请继续收看下期视频 Thanks for watching.谢谢观看
  • 2021-08-2414/44 演示:数字Numbers in Code《代码中的数字》 So I’ve got Visual Studio open.好 我已经打开了Visual Studio软件 Let’s go and take a look at some code.让我们一起来看一些代码 So I’m going to create my variable pi.首先我要创建一个变量pi 1415914159 I know of is more digits and that but I’ve gotta stop somewhere.我知道后面还有更多数字 但到这里就可以了 And I’m storing that value in a variable我把这个值存储在一个变量中 and then I can just print that value up on the screen然后将这个值输出到屏幕上 using ctrl S to save here.用ctrl+s键保存 I love using keyboard shortcuts.我喜欢用键盘快捷键 And we’re just trying to get that message go away.接着我们试图将这个提示关掉 There we go.可以了 And now that I’m here. Run my code and you will see that it displays.运行代码 现在大家可以看到显示的内容 There’s my number 3.14159 on the screen.屏幕上显示的正是我输入的数字3.14159 So again, just showing yes of course I can store numbers in variables as well.刚刚只是再演示一遍用变量来储存数字是没问题的 And of course, once I’m storing numbers in variables,当然 一旦我们用变量储存数字了 let me just comment out this code.先注释掉这段代码 Ctrl KC, by the way,顺便说一下 Ctrl+K+C the keyboard shortcut for commenting out code.是用来注释代码的键盘快捷键 And now I can store a couple of numbers.现在我可以存一些数字进去了 First number equals five.第一个数字等于5 Second number equals six.第二个数字等于6 And then I can add those numbers together,接下来我就可以把这些数字加在一起 so I can do a print statement of first number plus second number.即用print语句 第一个数字加上第二个数字 Because of course the main reason we want to store numbers当然 我们想储存数字的主要原因 is because we’re going to need to do math with them.是我们接下来要对它们进行数学运算 So now I go ahead and I run this code using the up arrow.我们继续 我用方向上键 运行这段代码 Recall my last command and it comes back and returns 11.调用上一条语句 然后就会返回11 You can change that to a multiply你可以将加号换成乘号 and then you see 30.然后得到30 I can change that to a double exponent which means to the power of.我还可以将乘号换成两个指数 也就是幂 Save that code.保存代码 Rerun.重新运行 And apparently five to the power of six is 15,625.显然 5的6次方是15625 I admit that’s beyond the capabilities of what I can do in my head.不得不承认这已经比我大脑的运算能力要强了 But that’s why I like computers, they’ll figure that stuff out for me.但这是正是我喜欢计算机的原因 它们会替我执行这些计算任务 Computers are good at math.计算机很擅长运算 So this is a fairly common situation.当然 这是一个相当简单的情况 But as I mentioned, one of the problems you will run into is但是正如我提到的 你可能会遇到的问题之一是 let’s say I start asking the user to enter the values for the first and second number.比如说 我要求用户输入第一和第二个数的值 So I say, “Hey, emm…” Do an input statement and ask for user, “please enter a number.”那么我会说:“ 嘿 唔…” 调用input语句 要求用户 “请输入一个数字” And I’m going to take the number they entered and store it in that variable first_num.然后将他们输入的数字存入变量first_num And then, I say, “please enter another number.”然后我会说:“请输入另一个数字” And I decide to store that in the variable second_num.然后将它存入变量second_num So all I’ve done is taking the value for first number and second number and taking it from the input statement.总之我做的就是获取第一个和第二个数字 再通过input语句对变量赋值 The rest of my code is the same.剩下的代码和之前一样 And yet now when I run,但当我运行代码时 I’m gonna change this to a actuals probably give me an error.我故意将其改为一个可能会导致错误的代码 We’ll try this我们试一下 and it goes enter number.然后就到了输入数字这一步 I enter five and number six and it goes on.我输入数字5和数字6 然后继续运行 Unsupported operand type for the asterisk outer skirt score the power command.不支持的操作符类型 星号代表的幂指令 It’s like it’s got a string and a string.似乎它们被处理成两个字符串了 Well, what’s susan to the power of i back.好了 现在把幂符号加在i上 So the problem is it has two strings.所以问题是这里是两个字符 I can demonstrate this by changing to the power of symbols to a plus sign.我可以把指数符号改成加号证明这一点 Because now you’ll see the two strings concatenated together.因为现在你能看到两个字符串连在一起了 So now if I run and I’ll just move this to top the screen.如果我现在运行它 我先把这个挪到屏幕上方 So you can see it better这样你就可以看得更清楚 and I enter a number five and the number six然后我输入数字5和6 and you see it come back with 56.然后你可以看到返回值是56 So this is the problem I was talking about where the numbers are being treated as strings,我在说的问题就是 这里数字是被当成字符串对待 because the input function always returns a string,因为输入的程序总是返回一串字符串 even if that string contains a number.即使这个字符串由数字构成 A little confusing I know.我知道这有点容易混淆 So let’s go in and fix that.所以让我们一起深入探索它 So what we have to do is we have to go in here and say treat this number as an integer.所以我们需要做的是 我们从这步开始 然后把这个数字声明为整数 and treat this number as an integer同时也按整数处理 or take this variable and convert it to an integer.或者把这个变量转换成一个整数型变量 Now, when we enter five and six,现在 当我们输入数字5和数字6后 you can see it actually correctly does a math and returns 11.你可以看到它做了正确的数学运算 得到11 If you and we change that to a float rather than an integer,如果我们把这个换成浮点数 而不是整数 the only difference is an integer is for whole numbers,唯一的区别是整型数字适用于整数 a float can contain decimal numbers.而浮点数可以包含小数 So the only difference is going to be that when I entered the numbers, it shows 11.0.所以这里的唯一区别就是 当我输入数字后 它会显示11.0 Just a way of me recognizing.这是我的判别方法 It’s a number of it can contain decimals a floating point number.浮点数是包含小数点的 So here I’ve used datatype conversion to take a number of stored in a string to treat it as a number.这里我用了数据类型转换 将存储在字符串中的数据视为数字 But there was one other scenario where we had to do datatype conversion as well.但是也还有其他情况 让我们也不得不进行数据类型转换 So let’s take a look at that one.所以现在我们看看这个例子 So I’m going to comment out this code,那么现在我要把这段代码注释掉 ctrl KC to do that.按ctrl+KC键完成这一步 And what I’m going to do is I’m going to say days in February.我要做的是 我将会问二月份的天数 Help if I could type, February.像我这样打出 二月 If I could spell it like it spell February as well就像我这样拼写二月 apparently equals 28 and then I say print days in February,显然等于28 然后把二月份中的天数打出来 there we go, plus好了 相加 and I want to concatenate that total days in February.我想把二月份的总天数的数据连接起来 So sometimes we want to display a number inside a string on the screen.有时我们想在屏幕上显示一个字符串中的数字 So now when I go and I run this,那么现在我开始运行 go down here just clear the screen again with the CLS,一直到这里 用CLS命令清空屏幕 and we see it again it blows up.然后我们看到它再一次崩溃了 Take a look at that error message says unsupported operand type.看一下这个错误提醒 说不支持的操作数类型 You have a plus sign is the operand操作是加号 and you’re trying to give it an integer and a string.然后你要试着给它赋给一个整数和一个字符串 So it’s confused Python doesn’t know is this two numbers I should add up with math所以这个行为有些迷惑 Python不知道这个两个数字应该用运算符加起来 or two strings I should concatenate together,还是作为两个字符串链接起来 because it has one number and one string.因为这里有一个数字和一个字符串 So we have to use the string function所以我们必须用字符串功能 to convert days in February into a string datatype.把二月份天数的数据类型都转换成字符串型 So now, when I go and run it,所以现在 我来运行它 it comes back perfectly happily and comes back and says 28 total days in February.它的结果很完美 返回值为二月份里总共有28天 So now you have you build you use numbers in your code,现在你已经了解了在代码中使用数字的方法 but do be prepared to start doing battle with some data type conversions.但一定要做好准备去处理数据的类型转换 String, int and float could become your new best friends.字符串 整型数和浮点型数将成为你新的好伙伴
  • 2021-08-2417/44 错误处理>> Now, if you’re anything like me and I know I am,如果你和我一样 是这样的情况 the first time that I get into write some code,当我第一次写代码的时候 sometimes things go wrong.有时会有一些问题 Now, maybe whatever it is that you’re doing works flawlessly,或许无论是什么 你写的代码都能完美运行 but in the real world,但在现实世界里 things will go sideways,往往会出一些岔子 and sometimes it will be because of mistakes that you made,有时候是因为你所犯的错 things that you have control over,你可以控制的事情 and sometimes it will be because something has changed有时候则是因为某些东西变动了 like database has gone down,比如说数据库崩盘 a server name has been changed etc.,服务器名字变更等等 where now my application isn’t而我的应用程序并不能 going to be able to accommodate that on the fly,在匆忙之中就适应这一点 and needs to, well, potentially crash.因此可能会崩溃 So let’s talk about how we can那就来说说我们应该如何 deal with those different types of errors.来处理那些不同类型的错误 But before we talk about how to deal with them,但在开始讲处理方法之前 we should probably start defining a couple of different terms.我们可能要先定义一些不同的术语 I want to try and make a very clear distinction我想尝试非常明确地把 between error handling and debugging错误处理和调试区分开来 because these are two very different things.因为这是两个差别很大的东西 Sometimes people will use them to mean something synonymous,人们有时候会把它们用作同义词 and they’re really not.但它们真的不是 Error handling is when I have a problem with my code that’s running,错误处理是在正在运行的代码有问题 and it’s not something that I’m going to be able to predict而且这不是我能够 when I pushed my code out to production.在推行代码时可以预料到的问题 The most common example of this would be a permissions issue,最常见的例子包括权限许可问题 a database changing, a server being down, etc.数据库变动 服务器关闭等等 Those things that happen in the wild,那些意外发生的事情 those things that happen in the real world,在现实中发生的事情 there’s things that I do not have control over.有些事情是我控制不了的 Contrast that with debugging.对比一下调试 Debugging is when I know that there’s a problem with my code.调试是我知道我的代码有问题 That it’s potentially giving me a wrong answer,这可能会得出一个错误的结果 it’s potentially crashing,可能导致崩溃 and I know that there’s something that I’ve done incorrectly我还知道是我某些地方做错了 that’s causing my code to go sideways.才导致我的代码出岔子 That’s debugging.那是调试 So when we get in and we take a look at things因此当我们进入代码 查看 like try/except/finally像try/except/finally which we’ll talk about in a minute,这些稍后会讲到的语句 those are not useful tools for handling debugging.它们都不是用于调试的有用工具 Debugging again, I’ve got a problem in my code,再重复一次 调试是我的代码有问题 I’m trying to fix that problem,并且我在尝试解决这个问题 that try/except/finally is where there’s something that’stry/except/finally语句则是 happened external to my application我的应用程序外部发生的事情所在 that I couldn’t predict that something might go sideways,这些事情是我不能预料到会出岔子的 and I want to be able to exit gracefully.我还希望能够优雅地退出程序 So we want to make sure that there’s所以我们要确保 a separation between those two.这两个东西被区别开来 Now when we’re talking about errors,当我们说到错误 things that can go wrong inside of our code,也就是代码里会出错的东西的时候 these fall under three different categories,可以将其分为三类 syntax errors, runtime errors, and logic errors.语法错误 运行时错误和逻辑错误 Let’s start from the top, a syntax error.先从头开始 说下语法错误 With a syntax error,出现语法错误时 a code is not going to run at all.代码根本不会运行 Believe it or not, you have to choose between the errors.信不信由你 如果你必须在错误之间进行选择 This is the type of error that you want.这是你想要的错误类型 This is typically going to be the easiest to try and track down通常这是最容易尝试追踪的错误 because of the fact that again, your code is just going to fail right then and there,因为你的代码肯定会立即出故障 and the error message that you’re going to get你将会得到的报错信息 will typically point you right to where the problem is.通常也会直接指出问题所在 So if we take a look at our output,如果我们看下输出的东西 you’ll notice that it’s actually telling me right here,就会发现这实际上告诉了我问题所在 let me go ahead and circle that.让我来把它圈出来 It’s telling me right there the line of code.它在告诉我 那行代码有问题 So if we take a look at our little block of code,如果我们看一下这一小段代码 and we’ll talk a bit more about if statements later on,我们稍后也会进一步讨论if语句 what we’re actually missing right out here实际上 这里的“y”后面 after that “y” is a colon.缺了一个冒号 So that’s why it’s giving us a syntax error这就是出现语法错误的原因 because we’re missing a key there.因为我们在那里缺少了一个键 Now, one really nice thing about PythonPython的一个好处是 is because of the fact that it’s not using curly braces,因为它不使用大括号 you won’t have to worry about tracking down a curly brace所以你不用担心在代码里找错误的时候 when trying to figure out what’s wrong with your code.要去追踪一个大括号 If you’ve done something like Java, or JavaScript,如果你用的是Java或者是JavaScript you’ll know that can sometimes cause some problems.你就会知道 找大括号有时候会挺麻烦的 So syntax errors are good errors. We want those.所以语法错误是好处理的错误 是我们想要的 Now, runtime errors are the second best type of error.运行时错误是第二好处理的错误类型 Where a code is running,在代码运行的地方 something has gone wrong,出了点问题 and it’s going to blow up.而且就要爆炸了 Now, in my case here,在这个例子里 the problem that I’m going to run into is that将会碰到的问题是 I’m trying to do that classic divide by zero.我试着做除以0这个经典的运算 When we hit that line in our code,当我们运行这行代码时 it’s going to give me that error message that you see你就会看到报错信息 right down there at the bottom, “division by zero”.在底部出现 “除以0” It’s also very handily going to point me at它也很便利地指出了 the line number where the problem occurred.问题所在的代码行号 Runtime errors are actually pretty decent errors,运行时错误实际上是相当好处理的错误 because they will give me a little bit of information right upfront因为它们会给出一些预先提供的信息 to let me know where to start for trying to debug my code.好让我知道从哪里开始尝试调试代码 Now, when you get a runtime error,当你遇到一个运行时错误 the basic strategy here is to基本的应对策略就是 start from the line that it’s given you,从它告诉你的那一行开始入手 and then work your way up to see where the error occurred.逐步检查是哪里出的错 Now, one important tip that I want to give you here.在这里我想给你一个重要提示 When you’re dealing with a runtime error,当你在处理一个运行时错误的时候 I’m going to guarantee you我向你保证 the problem is somewhere inside of your code.问题出在代码内部某处 One of the most common mistakes that I see new developers make我所见到的新手开发者最常犯的错误之一就是 is they’ll go in, they’ll try something, it’ll go wrong,他们进入代码 做些尝试 然后出现错误 and they’ll make the assumption that there’s他们会设想 a problem inside of the framework that they’re using,是他们正在使用的框架内部有问题 inside of the runtime that they’re using etc.是正在使用的运行时内部有问题 诸如此类 While it’s technically a possibility,然而 虽然从技术上来说是有这种可能 chances are it’s not going to be there.但更多时候问题并不出在那里 So much so that you probably have better luck of hitting罕见到 比起在框架里找到一个错误 the lottery than you do in finding an error inside of a framework.中奖可能还更容易一点 Again, I don’t want to say that this doesn’t happen,再说明一下 我不是说这绝对不会发生 but it’s extremely rare.但发生的几率实在太小了 I can pretty much guarantee you我可以在很大程度上保证 that if you’re getting a runtime error,如果你遇到运行时错误 assuming that it’s not something like a server being down,假设不是服务器关闭之类的事情 it’s an error inside of the code that you’ve written.那就是你写的代码内部的错误 Start there, finish there,从那开始 从那结束 that’s where the problem’s going to be.那就是问题所在 Let’s close out our conversation about try/except/finally让我们以一些零星的小知识点结束关于 with a couple of last little odds and ends.try/except/finally语句的讨论吧 First up you’ll have noticed inside my demo首先 你会注意到我的演示里 that I had a try except else, not also works.有try-except-else语句 但也不能运行 Where in that case the else is going to be like在这个例子里 else语句是 I have up here which is that blank except,我在上面写的那行空白的except语句 where I’m just not looking for a particular parameter.在那里我没有追求一个特定的参数 Either one will work just fine.这两个都可以 For me, I kind of like that except我个人比较喜欢用这种except语句 just because it’s a little bit more consistent因为它更能和 with a lot of other programming languages.很多其他的编程语言协调 But again, feel free to use whatever it is that you might like.但再强调一次 随便用你喜欢的就好 Now, some final words here.最后说几句 I know that first bullet point might be a little bit confusing,我知道第一点可能让你有点疑惑 but hear me out,但听我的 try/except/finally is not used to find bugs.try/except/finally语句并不是用来找bug的 Let’s again identify what a bug is.再重温一次bug的定义 Bug is where I have something wrong in my code,Bug是我写的代码里出错的地方 where I know that this code will not我知道 如果遵循这个特定路径 run if it follows this particular path,或者做这个特定的事情 or does this particular thing,代码将不会运行 and I have control over them.并且我可以控制它们 If it’s something where a server might be down如果是服务器崩了 or I’m getting input from a user,或者是我从用户处得到输入 where I don’t necessarily always我不一定总能 have control over those types of things,控制得了的这类情况 now try/accept/finally is perfect.此时try/except/finally语句就是最好用的 But if I know there’s a problem in my code,但如果我知道代码里有问题 that’s not where I’m going to put in that try/except/finally.那就不该用try/except/finally语句 It’s also worth highlighting the fact that同样 值得强调的是 you don’t have to catch all errors.你不需要捕获所有错误 If you’re not going to do anything with it,如果你不打算做什么 if you’re not going to log it,不打算记录它 if you’re not going to gracefully exit,不打算优雅地退出程序 then just leave it alone.把错误放着不管就好了 I will always remember when I was我永远记得我那时候 working with a framework that another developer had written,在用另一个开发者写的程序框架工作 what they had done is they had programmed it such他们所做的就是编写了这样的程序 that if the database threw an error, that it would catch it,如果数据库抛出一个错误 程序就会捕获它 and then give me back some just generic error message,然后给我反馈一些通用的报错信息 which made debugging my application impossible.这使得我完全不能调试我的应用程序 Because I could never see what the original error message was.因为我永远看不到原始报错信息是什么 So if you’re not going to do anything with it,所以如果你不想做任何事情 just let it go.放任不管就好了 You might be thinking, “Well, wait a minute Christopher,你可能会想 “等一下 克里斯托弗 that might crash my application.’这会让我的应用程序崩了的呀” Well, you know what?好吧 你知道吗? Sometimes that’s exactly what we want to have happen.有时候这正是我们想要的结果 That if our application winds up in如果应用程序停留在 a state where it’s just flat and not going to work,一种平稳却不能工作的状态 that’s okay. Let it crash.那没事 尽管让它崩吧 That’s exactly what crashes are there for.那正是程序崩的原因 That’s actually sometimes just fine. All right.事实上有时候崩了也挺好的 那好 The last type of error that we want to highlight is a logic error.我们要强调的最后一种错误是逻辑错误 Our logic error is when our code compiles properly if you will,逻辑错误是 如果你愿意 代码可以被正确编译 there’s no syntax errors.没有语法错误 It doesn’t give us an error message,没有给出报错信息 so there’s no runtime errors.所以也没有运行时错误 It just doesn’t give us the response that we’re looking for.它只是没有给出我们想要的结果 So in my case,在这个例子里 what you’ll notice is that I’ve你会注意到 got a couple little variables here.这里有一些小变量 Let’s go ahead and继续来看 make sure that I grab my highlighter. There we go.确保我选中了荧光笔 好了 I’ve got my y being 206,我设定y=206 my x being 206,不对 是x=206 I’ve got my y being 42.y=42 Then what I’m going to do is I’m going to say, “Hey,接下来我要说 “嘿 if x is greater than y,如果x大于y then let’s print out is greater than y.”那就输出大于y” Now I go ahead and I run my application,接下来 运行一下程序 and I don’t get a response.没有反应 If you were listening carefully,如果你有认真听讲 you’ll probably have noticed that I actually你可能已经注意到我实际上 said the incorrect code here.在这里说了一个错误的代码 That what my code is actually saying is x less than y. Right here.我的代码实际上写的是“x That’s what I actually wrote inside of my code.那才是我实际上在代码内部编写的内容 What I had meant to write however is if x is greater than y.但我想要写的是 如果x大于y This is without a doubt the most common error,毫无疑问这是最常见的错误 or most common mistake that I made,或者是我最常犯的错误 that I will frequently reverse my Boolean.我会经常反转我的布尔值 Little side note here.这里提一下 I would definitely recommend taking a look我十分推荐去了解一下 at unit testing and test-driven development.单元测试和测试驱动开发 They’re concepts that are beyond the scope of this course,虽然这两个概念超出了本课程的范围 but what they’re about is writing但它们是关于如何编写 little automated tests to try and catch mistakes in your code,很少的自动测试来尝试捕获代码里的错误 and they’re really very much designed而且它们是专门设计来 to catch these types of mistakes.捕获这类错误的 I’m a huge fan of unit tests.我非常喜欢单元测试 Definitely recommend taking a look at unit test inside of Python.强烈推荐你去了解Python内部的单元测试 In any event, logic errors,不管怎样 说回逻辑错误 again, are the types of errors where everything runs,是指那种程序运行 but we just don’t get the right response.却没有正确响应的情况 So how do we then start tracking all of that down?那么我们应该如何开始追踪这些错误呢? Well, if you do wind up getting something going sideways on you,如果你确实在结束时出了一些岔子 and it potentially throws an error message,界面可能弹出了一个报错信息 take a look at the stack trace.那就看一下堆栈跟踪 Stack trace is going to show you堆栈跟踪将会显示 all of the different calls that have been made,所有已进行的不同调用 the last calls are at the top,最后的调用会在栈顶 the most recent ones are down at the bottom.最近的调用则压入栈底 That’s where your code is going to be.底部就是你的代码所在的位置 Look for line numbers,找出行号 that will give you a perfect place to start.这将会是一个完美的切入点 Now, to try and find your mistake, reread your code,接着是尝试查找错误 请重读你的代码 check the documentation, as always,一如既往地检查文档 search the Internet, Stack Overflow is your friend.上网搜索 Stack Overflow是你的好帮手 Maybe just take a break.或许休息一下 I can’t tell you the number of times where I’ve我也数不清我有多少次 been battling a bug in my code,一直在拼命解决代码里的bug and I just simply took a walk.我只是散了散步 Or I went home for the day, had dinner, slept,又或者是回家过了一整天 吃晚饭 睡觉 woke up the next morning,第二天早上醒来 and then that’s where my problem is,终于想到是哪里出了问题 sat down with the computer,于是坐在电脑前 went in, and fixed it.进入代码 解决问题 Sometimes you just simply need to walk away.有时候 你只需要走开一下就好 The other big thing that sometimes you need,有时候你需要做的另一件大事是 is just another set of eyes.用另一个视角去看待问题 So if you work with somebody who does Python,如果你和用Python的人一起工作 have them take a look at your code.让他们来看看你的代码 Sometimes that fresh viewpoint will be exactly what you有时候 这种新鲜的观点正是你所需要的 need to try and debug your code.用来尝试调试代码的东西 That is how we can deal with这就是我们可以如何处理 the different types of errors inside of Python,Python里不同种类的错误 and when and how to use that try/except/finally.还有 何时以及如何使用try/except/finally语句
  • 2021-08-24Unity 5 官方教程 #4So let’s set up our play field.现在我们来设置游戏区域 The play field for our game will be very simple.我们的游戏界面很简单 We will place walls around the edges to keep our player game object from falling off,我们会在平面上放置墙体 避免球体跌出 and we will create and place a set of collectable objects for our player to pick up.然后放置一系列可采集的对象 让玩家收集 First let’s create a new game object,首先我们新建一个游戏对象 and rename it “Walls”.重命名为”Walls” This will be an organizing parent game object for our Wall objects.它将是所有墙体对象的父对象 Let’s look at the organization of our Hierarchy.让我们来看看层级面板的结构 Organization in our projects and our Hierarchy are very important.项目结构及层级关系十分重要 We need to understand that organization at a glance.我们得理解其内部组织结构 We organize our projects using folders or directories我们用操作系统生成的文件夹或者目录 created by our operating system.来组织我们的项目 We add these to our project using the Project view’s Create menu.我们用项目视图的Create菜单来新建它们 We organize our Hierarchies by using GameObjects.我们用游戏对象来组织层级关系 In our Scene, GameObjects can hold other GameObjects.在场景中 一个游戏对象可以包含另一个 Don’t be afraid to use an empty GameObject as a directory or folder in the Hierarchy.不要害怕用空游戏对象 当做层级面板目录或者文件夹 Reset this GameObject to origin,将游戏对象重置为初始状态 and this step is important.这一步非常重要 We want the Transforms of all of our container GameObjects to be at origin before we use them.在使用之前 所有的游戏对象的Transforms状态要置为坐标原点 Now we will build our walls.现在我们构造墙体 Let’s start by creating a new Cube to be our first wall.让我们新建立方体 使之变为我们的第一面墙 Rename this “West Wall”.重命名为”West Wall” Reset this game object to origin.将它重置到原点 Now parent West Wall to our Wall’s game object.现在West Wall是Walls的子对象 Let’s focus our Scene view camera to our Wall object.让场景摄像机的焦点对准墙体对象 We can do this by typing the F key我们可以把光标置于场景界面 while the cursor is over the Scene view,并按下F键 or by selecting Edit > Frame Selected.或者点击Edit工具栏下的Frame Selected We need to change the size of the Cube to fit one side of our play area.我们需要改变立方体的尺寸 让它适应游戏场地的边缘 Change the Cube’s Transform Scale of X, Y and Z改变方体Transform组件中Scale的X Y及Z大小 to 0.5 for thin, 2 for tall, and 20.5 for long.分别为0.5作为厚度 2为高度以及20.5作为长度 Now we can simply push the wall into place using the Translate tool,现在我们使用Translate工具将墙体移至边缘 or we could enter a value into the Transform component.或者我们在Transform组件的Position中输入值 In this case we can set the Transform’s position X value to -10.在这里我们将X的值改为-10 This places the wall neatly to the edge of our play area.这就将墙体放置在了场地的边缘 To create the next wall we could start with another new Cube,再用新的立方体新建一面墙 but then we’d have to rescale this new Cube before we placed it.但是这样做我们就得像之前一样设置参数 Our current West Wall is already the perfect size.我们现有的墙体已经达到了完美的大小 So let’s duplicate the West Wall game object.所以我们复制一下West Wall游戏对象 Let’s rename it “East Wall”.重命名为”East Wall” To place the wall simply remove the negative sign,要放置它只需简单去掉负号即可 and it pops into place on the east side of our game area.然后它就会出现在东边的位置 Now let’s duplicate the East wall…现在我们复制East Wall and call it “North Wall”.重命名为”North Wall” Reset the X position so the North Wall is in the center of the play area.将其Position的X重置 它就出现在了场地的中心 We now have two choices:我们现在有两个选择: We can rotate the wall by 90 degrees,我们可以将墙体旋转90度 or, as this is a cuboid,或者 既然它是立方体 we can rescale the wall to 20.5 in the X and 0.5 in the Z.我们将Scale的X重置为20.5 将Z重置为0.5 Now it’s scaled correctly for its orientation as the “North Wall”.现在”North Wall”的方向正确了 We can drag the wall into place,我们将其拖动到北面 or we can simply use the value of 10 in the Transform’s Position Z field to place it.或者我们也可以将Transform组件中Position的Z 设置为10 Duplicate North Wall and call it “South Wall”,将其复制并重命名为”South Wall” and -10 in the Z-axis pops it into place.将其Position的Z设为-10使其出现在合适的位置 Enter Play mode and test.然后进入游戏模式进行测试 Fantastic, the walls work fine.很棒 墙体起了作用 Remember to test early and test often, don’t wait to test.记住要尽早并经常测试 不要等到最后 Let’s exit Play mode.现在退出玩家模式 Let’s highlight the Player GameObject选中Player游戏对象 and set the Editor to Local mode,然后将编辑器设置为Local模式 and try again.再试一次 Note how in Local mode we can see the Transform rotate.注意观察在Local模式中 Transform的值如何变化 Let’s exit Play mode.现在退出测试 In the next episode we will be creating our collectable pickup objects.下个教程中我们将创建可被收集的物体
  • 2021-08-2420/44 演示:条件逻辑Adding Conditions更多的条件语句 So now let’s go take a look at some code现在我们来看看if语句相关的 using those conditional statements in our if statements.用到条件表达式的代码 So if we take a look at this code here,我们看看这里的代码 this is the code I’m using to calculate the tax rates in Canada.这是我用来计算加拿大税率的 As I mentioned in Canada,我有说过 在加拿大 you don’t pay tax on an item unless it costs at least one dollar.任何低于1元的物品都不需要缴税 So what I’m going to do is所以我要做的就是 I’m gonna ask a user how much did they pay,询问使用者付了多少钱 then I’m going to convert that to a number,然后我把它转换成一个数字 thinking back to what we learned about datatypes and working with numbers.你可以回想下我们学过的数据类型和数字的使用 We want to treat this as a number,我们要将输入当作数字 but the input statement always return strings.但这里input的返回值为字符串 So I am just converting that price into a number here,所以我要把这里的价格转换成数字 and then I’m just going to say if that price is over a dollar,然后就可以说 如果价格超过1元 then the tax is 0.07,那么tax的值就是0.07 so seven percent, and then I’m just going to print the tax rate,即税率为7% 然后我要把税率 on the, on the screen.打印到屏幕上 So let’s see what happens when you actually try to run that code,那我们来看看尝试运行代码会发生什么 and we’re going to do call Python这里我们调用Python and we’re going to call check_tax.然后调用check_tax文件 If we pass in a price of $20,如果输入的价格为20元 which is definitely more than one dollar,这样一个大于1元的值 we should see tax rate of 0.07,我们应该看到税率为0.07 and sure enough it comes back tax rate is 0.07,当然这里确实得到了0.07的税率 whereas if we enter a tax但如果我们想得到 if I paid $0.50 for something,0.5元价格对应的税时 you’ll notice it does not come back and print anything out at all你会发现程序没有反应 什么也没有打印出来 because both these two statements are indented.因为这两句表达式都加了缩进 So neither of these statements is executed所以除非条件为真 unless the condition is true.这两条表达式都不会被执行 So they have these four spaces,它们前面都有四个空格 that’s how Python knows which lines toPython正是通过这些缩进来知道 execute when the condition is true, that indentation.条件为真时 哪些行需要被执行 Now, we can add an else statement to this.现在我们可以加入else了 So I might want to say if the price is under a dollar,我想做的是 如果价格低于1元 then let’s not charge any tax.我们就不收税 So I’ve just added some logic here exactly the same code,所以我在此加入逻辑语句 用的是同样的代码 but all I’ve done is said otherwise with但这里我用的是否则 an else statement and there’s that colon,用else来表示 后面加上冒号 don’t forget the colon at the end of your statement.不要忘记在表达式后面加冒号 At the end of your statement I’m forever forgetting that,我老是忘记在表达式后面加冒号 it’s one of my most common syntax errors.这是我最常犯的语法错误之一 Then otherwise the tax equal 0.那否则税率就为0 So now if I run this code,所以我现在运行的话 I will just clear screen start off again我先清空屏幕 然后开始 Python add_else.py,先输入Python add_else.py and how much should I pay if I paid $50,然后我要知道的税率对应的价格是50元 then the tax rate comes back as 0.07.然后就得到7%的税率 So that’s correct. If I go ahead and I run it and I pay 0.50,结果正确 如果我要继续得到0.5元时的税率 it comes back and says the tax rate is 0.这里就会得到0税率的输出 So now this is a little cleaner,所以现在更清楚一点 a little more elegant because I’ve got a tax value that’s set也更优雅一点 因为我现在可以 according to any possible input.根据任何可能的输入得到税率值 Now, one of the other things I mentioned is that if you wish,那我之前有提到过的一件事是 只要你想 one of the different ways you could do this is我可以用另一种方式 I could write this code exactly the same,用的还是同样的代码 by simply taking the print statement,只是改改这个print的表达式 because printing the tax rate, I always want to print the tax rate,因为输出税率这件事是我一定要做的 regardless of what the tax is.不管税率是多少 The only thing that changes is唯一改变的就是 what I assign the tax rate to.我所指定的税率的值 So in my if statement, I say if the price is over a dollar,所以在我的表达式中 如果价格超过1元 set the tax rate seven percent.税率就设为7% If the price is under a dollar,低于1元的话 set the tax to zero.就设为0 Then, regardless of what the final tax rate was,然后 不管最后的税率是多少 print that on the screen.都把它打印到屏幕上 So by taking this out of the if statement and it’s not indented,所以通过把这一句拿到if之外并且不加缩进 that means this statement will be executed all the time那么无论if中发生了什么 no matter what happens in the if statement.这句表达式总是会被执行 So now when I run this code,所以当我运行代码时 and now I enter $50,输入50元 you’ll still see exactly the same output,你会看到完全一样的输出 tax rate is seven percent.税率还是7% If I enter a price of 25 cents,如果输入价格是25分 you’ll see tax rate comes at zero.你会看到输出税率为0 So the same things happening in my code,所以 代码的结果是一样的 I’m just using a different way to achieve it.我只是用了不同方法来实现 Now, there’s one other example I wanted to do,现在我还想演示另一个实例 and that was showing you the case sensitivity.这会告诉你大小写敏感性的事 Right now I have a little line of code here that says,我这里有一行代码 说的是 please enter the name of your home country请输入你的祖国的名字 and if the country is Canada,如果是加拿大 then you must like hockey.那你一定喜欢冰球 “Hey, I’m hockey Geek Girl on Twitter for reason.“嘿 我可是推特上的冰球极客女孩 I am a Canadian, I love my hockey,我是加拿大人 我爱冰球运动 I fit the stereotype.”我这样很符合刻板印象” Otherwise we say, “Okay. You’re not from Canada.”否则我们就说 “好吧 你不是来自加拿大” So if I run this code,如果运行代码 comparing_strings is the name of my file,这里我的文件名是comparing_strings enter name my country and I enter canada,输入我的国家的名字 就输入canada as long as I entered all lowercase, it’s says, “Great.只要我都输入小写 那么它就说 “好极了 so you must like hockey,” and I’m like,所以你喜欢冰球” 我就说 “Yes, you are right Python, I do.”“没错Python 我爱冰球” But if I run it and I happened to enter uppercase letters,但如果我运行时碰巧输入了大写字母 then it comes back and says I’m not from Canada.它就会告诉我我不是来自加拿大 So this is a case where这个案例就是说 I have to remember that in Python when you’re comparing two strings,我必须记住 当使用Python比较两个字符串时 they’re not equal to each other if one has如果一个有大写 一个有小写 uppercase letters and one has lowercase letters.那它们是不相等的 So what I can do, is I can take the value that was passed in,所以我能做的就是拿到输入的值 convert that to lowercase,把它转换成小写 and then that returns然后就会返回 a lowercase version where I typed in which输入值的小写形式 will match the lowercase string canada.就能和字符串canada匹配 So now even if I type in, oops,所以即使我输入 哦哦 I need to save that file that would help.我要先保存下文件才行 It’s correct that, just realize I hadn’t actually好了 我刚意识到我还没有 hit “Save” using control S to save.点击保存 crtl+s就行 Now when I run it and I enter Canada,现在我运行 然后输入CANADA even if I enter all uppercase letters,即使输入都是大写字母 it still comes back and recognizes that I am from Canada,它还是能够认出 我是来是加拿大的 so I must like hockey.所以我肯定喜欢冰球 So there you have it.就是这样 Now let’s move on and look at现在我们继续来看一些 some more complicated situations we can deal with in conditions.可以使用条件语句解决的更加复杂的情况
  • 2021-08-24你的所有电子设备是如何运行的?通常 我们认为硬件和软件是独立的 它们一起为我们提供 我们熟知并不得不接受的计算体验 硬件 是实实在在可以触摸到的东西 例如键盘或硬盘 而软件则是一系列的代码 让你的电脑启动游戏 发推特 并最终显示在屏幕上 你可能也听说过”固件“ 这个常被提及的概念 固件是什么呢 是你买来证明长期健身练就钢臀的装备吗 不是 固件通常被认为介于软件和硬件之间 实际上 固件是一种特殊的软件 但它不像你的操作系统或其他任何程序 它既不在硬件上 也不在固态硬盘上 而是在专门的内存芯片上 正因如此 且固件与主板离得非常近 让人们觉得它像是一个硬件和软件的混合元件 “离主板近” 究竟意味着什么呢 事实上 组成固件的代码 与硬件之间可直接联系 它不像一般的程序 需要经过API 操作系统和设备驱动 原因在于 它为系统硬件提供了基本连接 和控制方法 举个例子 在一台电脑里 有一个储存系统UEFI或bios的芯片 它是一种特殊的固件 点击这里 你将了解更多 当你按下电脑开机键 bios就开始运行了 根据配置 它开始初始化硬件 并进行自检 自检完成后 bios启用虚拟化将它的权限交由更复杂的操作系统 如微软或苹果系统 然而 即便在操作系统开始运行后 bios在旧的操作系统中仍为外围设备 如键盘和系统软件 提供一种简单 可信任的连接 相比之下 其他固件在系统运行中的作用更为积极 桌面显示器需要解码由DP或HDMI接口 传送的数字信号 并通过运算来实现你所看到的画面 因此 这个过程需要一些固件来操控 当你打开屏幕菜单进行亮度调试等操作时 你所看到的 就是固件在扮演着显示器操作系统的角色 因此 即便是像电视遥控器这类简单的设备 也需要通过固件连接按钮 将红外光束转化为电视所能理解的指令 由于固件对此类基础连接的重要性 它有时候需要被更新 以提供额外的功能或修复漏洞 一个很好的例子就是为主板更新bios后 原有的插槽可支持新的CPU 由于大多数电子设备无法脱离固件工作 一般情况下建议不要轻易更新固件 除非出现某个你确定更新才能解决的问题 因为一旦更新失败 例如 更新过程中出现断电等情况 系统可能就永远变成“砖头”了 不像损坏的操作系统 你可以删除 重装 损坏的固件常常无法修复 因为这时系统根本无法理解 你想删除并重装固件 所以 千万不要损坏固件 也有一些现代系统为了避免此类问题 配置了一个备份的 或防故障的bios 但大部分设备没有这个固件 因此 更新固件时务必谨慎 例如 保证电源通电 为台式机或电视配备不间断电源 确保固件来源可靠 是否来自原装厂商 等等 此外 也有一些固件是无法更新的 例如存储在ROM或其他只读芯片上 这类从根本上无法更新 或被某种软件锁锁住的 也有一些无需更新固件的设备 例如U盘 也有一些带有专利特征的固件 设计初衷在于不让竞争者效仿 然而 固件上的软件防卫通常能被轻易突破 例如通过自制的可以启动附加功能的固件 或被黑客利用作为攻击向量的固件 固件通常没有任何加密 相反 开发者更关注操作系统与应用的安全 这也让固件成为黑客与情报机关攻击的目标 因为显然就算重新格式化硬盘 也无法清除针对固件的非法入侵 并且这种入侵很难被发现 又因为固件直接控制硬件 破解固件甚至能够从根本上造成硬件损坏 这里有一个几年前的案例可以用来做概念证明 一位研究者通过“黑”掉苹果电脑的电池固件 导致过度充电并造成电脑永久性损坏 希望没人知道如何“黑”进我刚买的pizza刀 脑补一下 pizza刀被“黑” 发生这种情况 我不知道要怎么办 以下为广告时间
  • 2021-08-2419/44 条件逻辑Handling Conditions处理条件语句 Okay. So now once we get into writing more complex code,好的 既然现在我们能写一些更复杂的代码了 at some point, you’re going to need to be able to say,有时 你可能需要能够做到 when this happens do this,这个发生了 做这个 when something else happens, react differently.那个发生了 就要采取不同做法 So that’s why we need to be able to handle conditions inside of our code.因此我们需要能够在代码中实现条件控制 Basically, you’ll need the ability to react differently基本上 你需要能够根据所发生的事做出不同反应 and take different actions based on what’s happening.以及采取不同措施 So one of the more common situations,举一个普通的例子 in Canada, we have all sorts of different tax levels在加拿大 我们有各种不同的缴税标准 depending on which State Province you live in within the country.这取决于你住在这个国家的哪个省 In the US, it depends on what state you live in,在美国则取决于住在哪个州 and it also depends on the price.当然还有价格 Actually, if you’re buying fast food如果你在加拿大的某家餐馆 at a restaurant in Canada if anything costs less than $1,购买了任何价格低于1元的快餐食品时 you don’t pay any tax on it.你是不需要付税的 So when we’re calculating tax,所以计算税款时 we actually say if the price我们会说 is over a dollar or equal to a dollar,当价格大于等于1元时 then we charge a certain amount of tax.我们要收取一定的税款 So in Python, I can handle that by adding an if statement,在Python中 通过加入if语句即可处理这种情况 and you’ll notice a bit of syntax当然这里需要注意 here you’ll see the if statement.if的一点语法 If that’s fairly obvious.蛮清楚的 If my price is greater than or equal to $1,如果价格大于等于1元 then I’m going to take the following actions.我就要采取以下措施 Now, a couple of things to watch out for.这是要注意的几件事 There’s always a colon at the end of your condition that is这个条件语句的最后总是要加一个冒号 Python-specific syntax and the indentation here.这是Python的特定语法 还有这里的缩进 It’s not an accident that the word tax is这里这个词tax被放到4个空格后面 moved over to the right here by about four spaces,并不是意料之外的事 and it did by four spaces not a tab.而且是4个空格 不是制表符 Though if you’re using Visual Studio Code,不过当你使用Visual Studio Code时 it’ll auto correct that for you so you can get它会自动帮你纠正 所以你 away with using a tab in Visual Studio Code.在Visual Studio Code中使用制表符即可 But try to get in the habit of making four spaces,但应该试着养成用4个空格的习惯 and anything that is four spaces in will only所有4个空格后的语句 be executed if the price is greater than or equal to $1.都只在价格大于等于1元时才被执行 Now, I’m using a greater than or equal to symbol here.现在 我要使用一个大于等于号 There’s different symbols we use基于不同情况 depending on the condition we’re looking for.有不同的符号可用 I might say greater than,这里有大于 less than, greater than or equal to, less than equal to.小于 大于等于 小于等于 The two most important ones然后最重要的两个 which do vary from programming language.这两个在不同编程语言中并不同 Programming language is equal to编程语言中的“等于” would be equal equal sign,是两个等于号“==” and the not equal to in Python is an exclamation mark equalPython中的“不等于”是叹号加等号“!=” or bang equal depending on the term you prefer to use.Python中的“不等于”是叹号加等号“!=” Now, we can also add a default action using an else statement.现在我们可以使用else语句增加一个默认操作 That’s a way of saying if this condition is met,就是说如果满足条件 set the tax to seven percent.税率就设为7% Else, so if it’s anything else do the following.然后“else” 如果是别的 就执行下面的 So if the price is not over a dollar,所以如果价格没有超过1元 the rules in Canada state.按加拿大规定不需缴税 If the cost is over a dollar,如果超过1元 you pay a tax of seven percent what we call our service tax.你要缴7%的税 我们称其为服务税 Otherwise, you don’t pay the tax.否则就不需要缴税 So I can use that with an else statement.我可以用else语句完成 Again, always remembering that colon.再次提醒 永远记住加冒号 I’ve constantly forget that when I’m writing my code我写代码时就经常忘掉 and have to correct that with the syntax error,结果不得不根据蹦出的语法错误来修改 just as you are learning syntax errors from Christopher.好像在像Christopher学习语法错误一样 And again you have to indent, by four spaces,还有 你必须对那些要执行的行 the lines you want executed if that happens.作4个空格的缩进处理 The indentation really does change execution.缩进确实是会影响执行的 I could actually write this exact same code a different way.一样的代码 我完全可以换一种方法写 I basically want to say calculate the tax基本上我就是要算税款 and then print the tax out.然后将其打印出来 So here I say if the tax price is over a dollar,所以这里写 如果价格超过1元 set the tax to seven and then print it.就设置税率为7% 然后打印出来 Otherwise, set the tax zero and print it.否则就设为0并打印 Or I could just say if the price is over dollar,或者 如果价格超过1元 set the tax seven otherwise set the zero,设置税率为7% 否则为0 and when I’m all done evaluating the correct value of tax,当我计算得到正确的税率了 then go and print the tax out.然后再打印出来 Both of these will do exactly the same thing.两种方法做的是一样的 Which one should you use?你该用哪个呢? I like the one on the right.我喜欢右边这个 It’s a little bit more elegant not去掉重复语句能够 having the print statement repeating.显得稍微优雅一点 But if it’s more confusing for you,但你不用觉得困惑 there’s nothing wrong with the code on the left.左边这个是没有问题的 Now, be careful when you’re comparing strings.现在 比较字符串时要小心 They’ll get you into trouble.你有可能陷入麻烦 So if you run this code and I’m just当运行这段代码 trying to see if somebody is a Canadian or not.看某人是不是加拿大人时 I ask what country somebody’s from and they type in CANADA,我问某人来自哪个国家然后得到输入“CANADA” and I say if the country is equal to-然后如果country的值等于—— remembering that double equal sign means is equal to-canada.注意这里用两个等号表示“等于”——canada Then print, “Oh look a Canadian”,那打印出来的就是“Oh look a Canadian” and obviously I set country to Canada.很明显 我设的国家就是加拿大 But it comes back and gives me the no you are not from Canada.但最后给出来的结果却是 “You are not from Canada” It did not evaluate the country.国家没有对上 What went wrong?哪里出错了呢? String comparisons are case sensitive.字符串比较是区分大小写的 So when you’re saying is this string equal to this string,所以当你要说两个字符串相等时 if one’s in uppercase letters and ones in lowercase letters,如果一个是大写 一个是小写 then Python’s going to say that’s not a match.那Python会认为二者并不相等 So how do I fix that?那该怎么改正呢 We’ve got to think back to wow我们回想一下 the module we did a little while ago on string functions.我们之前讲字符串函数那一节 There’s a function we can use我们可以用一个 that will convert a string to lowercase or to uppercase.能够将字符串转成小写或大写的函数 So what we can do is we can我们能做的就是 take the value they give us convert that to lowercase,把赋给我们的值转换成小写 and then compare that to the word canada all in lowercase letters.然后再把它和全都是小写的“canada”进行比较 So now when someone types in a value,所以当有人输入一个值时 it doesn’t matter what they type in,无论输入的是什么 I’ll convert it to lowercase before I do the comparison,我会在进行比较前将其转换成小写 and that will fix my error.这样便改正了我的错误 So this is a great example of a runtime error所以这是一个很好的运行时错误的案例 that can occur in a way and I can address that and fix that with my code.而我可以正确处理并进行改正 So conditions are very important and allow us to,条件语句确实很重要 它能够 our code to react to different situations.能够允许我们的代码处理不同情况 So let’s go take a look at these examples in in some code所以在我们继续看一些更复杂的代码之前 before we move on to more complex types of If statements.我们先来看几个这方面的例子
  • 2021-08-2415/44 日期数据类型Dates日期 So we worked with strings, we worked with numbers,我们讲完了字符串 又讲了数字 I think we’re ready for dates.接下来可以讲日期了 I say that with a heavy sigh 我之所以深叹了口气 because dates come with extra complications, 是因为处理日期是相当复杂的 and one of the more difficult datatypes它也是各种编程语言中 to work with in any programming language.比较难以处理的数据类型之一 The most common thing we need when working with date is simply当处理日期时 最常见的操作就是 I need the current date and time.我需要获得当前的日期和时间 We use this a lot这类操作一般是 when we’re logging errors or saving records and databases,在记录错误或保存记录和数据库时用到的 we want to know when it was saved我们想知道数据是何时储存的 or when that record was written or when something happened.或者何时写入的记录 或者某些事是何时发生的 So the way to get the current date and time所以 获取当前日期和时间的方法 is by using the datetime library.就是使用datetime库 Now, we haven’t covered libraries yet,现在 我们还没讲到涉及库的内容 trust me on this one, we’re going to get there.但是相信我 马上就会讲到的 So for now, don’t panic,所以现在先不要担心 just know that I’m using a function, but it is in a library,你只需要知道我正在调用库中的一个函数 this is a very Python type thing to do,这是一个典型的Python的方法 there’s a lot of libraries that do cool stuff许多库的功能都很强大 that save us a lot of time, they are your friends.能节省我们很多时间 是我们的好帮手 So I’m going to use the datetime library,我现在要用datetime库 and in particular of a datetime function in the datetime library具体来说是调用datetime库中的datetime函数 to ask for current date and time.来获取当前的日期和时间 So we’ll just take a look at that and see how it shows in code.我们来看看这个库函数 以及它在代码中的用法 So the line there from datetime import datetime,这一行是“from datetime import datetime” that’s basically saying,这大体上就是说 get me the datetime function from the datetime library.从datetime库中调用了datetime函数 More on that later in the libraries module.更多内容稍后在库这一节中讲 Then I can call datetime now,现在我可以调用datetime.now() and that will return the current date and time.函数就会返回当前的日期和时间 Notice, hey, I’m trying to be good with comments here注意 我在这里加了注释 I’m adding comments to my code,尽量写好注释 so I can remember all of this.这样我会记住为什么这样写 So the now function returns a datetime object.now函数会返回一个datetime对象 So I can say today is—我可以说今天是—— we were learning about data type conversions.我们之前有学过数据类型转换 If I just pass it current_date, it’ll go: what?直接把current_date传给它的结果是:什么? One’s a string, one’s a data, I’m confused.一个是字符串 一个是数据 我被搞混啦 That’s okay. You can just convert your date into a string不过没关系 我们可以把日期转换为字符串 just like we can convert numbers into a string,就像我们把数字转换成字符串那样 so we can print the output out on the screen.现在就可以在屏幕上输出结果 So I now have a way所以现在我可以 of getting the current date using datetime now,调用datetime.now()来获取当前日期 and if I can convert date to a string datatype,如果我能将日期转换为字符串类型 I can even combine it and displayed it on outputs.我甚至能把它们组合起来并输出 There’s a whole bunch of functions you can use一旦你开始和datetime打交道 once you start playing with datetime,你会发现有一大堆的函数可供调用 that’s why they’re so wonderful to work with.这就是它们如此好用的原因 So again, I go ask for datetime now.现在我再次调用datetime.now() So I’m storing that in a variable called today,我将返回值存储在变量today中 and then I can display today’s date,然后我就可以输出今天的日期 so that’s the same code I had before.这和之前编写的代码一样 But you’ll actually notice on the top,但你可能会注意到上面这里 I’ve added another function I want to use is called timedelta,我增加了另一个我要引用的函数 叫做timedelta it’s in the same library,和datetime一样 it’s in that datetime library, but it’s kinda cool.它也在datetime库中 但它蛮酷的 Because it allows me to say因为它能告诉我 how many days from today or how many weeks from today something is.某一天距今天有多少天 或多少星期 So I can define a time period of one_day,这样我可以定义一个时间one_day and it’s a timedelta of days equals one.=timedelta(days=1) If I want days equals three, it would be three days.如果我要写days=3 这个时间就是三天 If I wanted to measure a week ago, I could say weeks equals one.如果我要的日期是一周前 我可以写weeks=1 And I simply say, so yesterday equals today minus one day.我这里只要写 yesterday = today – one_day Now, think about how complicated that really is现在想想这实际上这有多复杂 because if today is the 1st of March,因为如果今天是3月1日的话 today minus one day, well that’s 28th of February,今天减一天 就是2月28日 but wait was it a leap year if so, then it’s 29th of February.但如果是闰年的话 那就是2月29日 I don’t want to have to write programming logic that figures all that out.我不想再写代码来判断这个了 So it’s much easier for me to use this timedelta function所以调用timedelta函数就容易得多了 and let the date magic of Python work,就让Python的日期函数去完成魔法吧 and say just what was the date one day ago?就拿一天前是哪一天来说吧 Then I can just print that out on the screen.我可以将它输出在屏幕上 So today was when I printed the slide,今天 我打印幻灯片的日期 it was the sixth of June,就是6月6日 and then yesterday was the fifth of June.那么昨天就是6月5日 So timedelta and these featurestimedelta和这些特性 really make dates worthwhile,极大提高了日期函数的可用性 and save you a lot of time coding.为你节省了许多写代码的时间 What it comes down to is when you are working with dates,总之 当你在处理日期时 if there’s something you want to do,如果你想要达到什么目的 there’s probably a function in datetime or another library或许在datetime或其它库中就有某个函数 that’ll do it for you.能够帮你完成 If you find yourself trying to count how many days in a week如果你想计算一周有几天 or months in a year or anything like that,或者一年有几个月之类的问题时 chances are there’s a function that’ll do it for you很可能某个函数能替你完成 to save yourself some time.从而节省你敲代码的时间 Now, what if I’m printing a date on the screen?现在我要在屏幕上显示一个日期该怎么做呢? By default it was displaying a very long date barrows,默认情况下 它会显示一长串的日期数据 day, time, hour, minutes, seconds, milliseconds.日期 时间 时 分 秒 毫秒 If you want to format it differently,如果你想用不同的格式 you could absolutely just request parts of it.你完全可以只要其中的部分 So I can ask for current_date.day所以我可以写成current_date.day So current have taken datetime now,current_date=datetime.now() the current date and time stored it in当前日期和时间存储在 a variable current_date,current_date变量中 and then I can say, just give me the day portion,然后我可以说 只给我“日”这部分 just give me the month portion, just give me the year portion,只给我“月”这部分 只给我“年”这部分 there’s also one for hours, minutes, and seconds.还可以只要时 分 秒 That way, I can decide what part of a date is important to me这样一来 我需要存储 使用或者输出日期数据时 when I’m either saving that data or using that data or displaying that value.就能突出其中重要的部分 Now, sometimes somebody won’t give you a date,有时候 你可能会缺乏日期数据 and then you need to store it as a date.而你需要把它存储为日期 Okay. Maybe that doesn’t make a lot of sense,好吧 我在说什么 but think back to what I was saying in the numbers module,但回想一下我在数字那一节所说的 when I was talking about when you use the input function当时我说 当你使用input函数 and you ask for value,获取一个值时 they’d always get stored as a string.而这些数值总是存储为字符串 So I say, hey when’s your birthday?所以当我说 嘿 你生日是哪天? I say my birthday is on the fifth of June 1999,你说 我生日是1999年6月5日 yeah, we’ll go with that.对的 就用这个 Then you want to store that value,然后 你要存储这个值 because right now it’s the string, you want to store it as a date.现在它是字符串 而你要把它作为日期存储 One of the things that’s more complicated is其中更为复杂的是 when you take a string, and you say stored as a date,当你得到一个字符串 你要将其存为日期 you have to know was a date given to you in day, month, year,你必须知道其格式是日/月/年 or as months, day, year,还是月/日/年 where you passed in a two-digit year or four digit year,你输入的是2位数字还是4位数字的年份 did they separate the day and month以及分隔日和月 with slashes or dashes or just spaces.用的是斜杠 破折号 还是空格 This is one of the reasons dates are so much fun to work with,这就是为什么处理日期相关的数据如此有趣 or so challenging to work with.或者充满了挑战 So if I run this code here,所以 如果我运行这段代码 you’ll see I’ve used the function STRP time.你会看到我调用了strptime函数 So this will basically allows me to say,这基本上就是说 here’s the format I will be receiving.这是我将要得到的日期格式 So I am expecting a date format所以我期望的日期格式是 which is day, then month, then year.日/月/年这样子 If you look up the documentation on this function,如果你查阅下这个函数的文档 it’ll tell you all a little abbreviation to use for two digit year,它会告诉你年份的2位数字缩写 four digit year, month, day etc.或4位数写法 以及月 日的写法等 And then, I can take that string that was received when I use the input function,这样我可以把调用input函数时接收到的字符串 convert that into a date function.传输给一个日期函数 Now, that seems like a lot of work 现在 看起来做了这么多 if all I’m doing is displaying the date on the screen. 只是为了把日期显示在屏幕上 But remember there’s some really cool functions但是记住 日期函数 that come with the date function,还包含很多很酷的函数 things that allow me to add track days and so on.比如跟踪日期等 That’s why it’s worth sometimes taking a date value that’s stored in a string,这就是为什么值得花时间去获取一个字符串类型的日期 and actually converting it into a datetime object.然后转换成datetime对象 If all you’re doing is storing, it you can just leave it as a string,如果你只是存储日期 保留为字符串类型即可 but if you want to play all those cool functions,但如果你想用到那些更酷的功能 you need it to actually be stored as a datetime,就确实需要将其存储为datetime对象 and that’s when you’re gonna need that strip time.而这时你就需要strptime函数了 So then I can do things like say, what was the day before my birthday?现在我就可以得到 比如我生日前一天的日期等 So once I’ve converted it using that STRP time function,一旦我采用strptime函数进行转换 then I have the ability to use that time delta function we saw earlier,我就能调用前面提到的timedelta函数 and find out what was the date one day before my birthday,从而查出我生日前一天是哪天 and so on. Awesome.功能多着呢 厉害吧 But as soon as you start doing this in your code,但是 一旦你在编程中开始使用此方法 you’re going to want to add some exception handling,你得添加一些异常情况处理句柄 because eventually someone’s going to enter a value,因为总会有些人输入的值 and the format they enter they might enter the 30th of February,以及格式 可能会像2月30日这样 or they might enter month, day, year,或者可能输入月/日/年格式 instead of day, month, year,而不是你所期待的日/月/年 and boom, your code’s going to blow up.然后 你的程序就崩掉了 So let’s make sure we handle exceptions gracefully.所以我们要确保能够从容处理异常情况 If somebody enters the date in the wrong format即使有人输入了错误的日期格式 rather than having it just blow up in their faces.程序依然不会崩溃掉 So error handling is also very important,因此 错误处理也是非常重要的 and we’re going to cover that in a later module.这点我们将在之后的章节中讲解
  • 2021-08-24想要入职谷歌,脸书,微软等公司推荐学习的5种编程语言Hey everyone welcome to CS dojo嗨 大家好 欢迎收看《开发大师》 my name is YK and I’m your host我是主持人YK and today we’re going to talk about今天我们将讨论 the top five programming languages to learn能让你在谷歌 脸书 微软等公司 for getting a job at companies like Google, Facebook, Microsoft etc找到工作的五大编程语言 so the obvious question here might be那可能你们想问的是 does it really matter which languages you learn如果想在这些公司找到工作 if you’re trying to get a job at one of these companies?学会某种编程语言真的重要吗? My answer would be yes it does我的答案是肯定的 but not directly但不是直接影响 What I mean by that is我的意思是 when you have a job interview with one of these companies当你应聘这些公司的 as a software engineer candidate软件工程师职位时 the most important thing they’ll usually look for is not他们通常最看中的并不是 what specific language or technology you’ve been using你会使用哪种特定语言或技术 instead they tend to look for mostly your coding skills而是倾向于考察你的编码能力 your problem-solving ability解决问题的能力 and your data structures and algorithms knowledge以及你的数据结构和算法知识 So you might say wait YK所以你可能会说 等等 YK so it doesn’t really matter which languages I learn then?那么我学习哪种编程语言并不重要吗? my answer to that would be我的回答是 actually it still matters a lot实际上仍然很重要 I’m going to explain my reasoning behind that in this video我将在此视频中解释其背后的原因 But if you just want to find my lists但如果你只想查看推荐清单 just keep over to this time in this video (2:54)可以直接快进到视频的2分54秒 Okay, so if these large companies don’t care that much好 所以如果这些大公司并不太在意 about which languages you know你会哪种编程语言 then why does it matter at all which languages you learn那么为什么在这些公司找工作时 if you want to get a job at one of these companies掌握某种编程语言又很重要呢? There are three reasons for this有三个原因可以解释 Reason number one第一个原因 when you apply for a job at one of these big companies当在这些大公司求职时 How do you think they will decide你认为他们是如何决定 if they should invite you for a job interview?给不给你面试机会的呢? Of course there are a few different aspects to this当然会有不同的考量方面 for example your education your personal projects and so on例如你的学历 个人项目等等 but the biggest factor is usually your work experience但最重要的因素通常是你的工作经验 and how do you get the experience in the first place?那么一开始如何获得经验呢? probably at smaller less known companies first或许首先是在鲜为人知的小公司 and actually smaller companies and startups tend to care more about事实上 小公司和创业公司往往更看中 which specific language or technology you know你掌握的特定编程语言或技术 so for example a small start-up might say比如 小型初创公司可能会说 we need someone who can help us create an iOS app tomorrow我们需要明天就能帮忙写iOS应用的人 or we need someone who knows或者在下个月之前能精通 JavaScript really well by next monthJavaScript的人 so depending on which languages you know所以对于这些小公司 it’ll actually be easier or harder for you to get a job你找到工作的难易程度 at one of these smaller companies取决于你会哪种编程语言 and reason number two第二个原因 I think you should learn a programming language我认为你应该学习 that aligns with your interests符合你兴趣的编程语言 so you have more motivation and reason for learning这样就有更多的学习动力和学习理由 So for example if you’re interested in learning to make an iPhone app例如 如果你对创建iPhone应用程序感兴趣 you should probably learn Swift你应该学习Swift and if you’re interested in data science如果你对数据科学 machine learning or science in general机器学习或科学之类的感兴趣 Python might be a good choice for youPython会是个不错的选择 reason number three第三个原因 some programming languages are simply easier to learn than some other ones某些编程语言学起来会相对容易一点 For example I would say比如我认为 JavaScript is easier to learn than JavaJavaScript比Java好学 and Python is easier to learn than C++Python比C++好学 So based on that因此基于上述 I decided to use the job market and ease of learning我决定将就业市场和易学程度 as the two main criteria作为两大主要标准 for making my list of top 5 programming languages to learn用于列出我推荐的五大编程语言清单 Ok so here’s my list好 接下来是我的清单 number 5: Ruby第五名 Ruby Ruby is a programming language from Japan这门编程语言源自日本 it became popular due to the popularity of因Ruby on Rails的流行 something called Ruby on Rails而受欢迎 and Ruby on Rails was at some point I would sayRuby on Rails曾是某段时间 the hottest framework for building websites我认为的最热门网站设计框架 although it’s not as popular as it used to be anymore虽然它的流行度已不如从前 Still a lot of companies use Ruby on Rails today但如今仍有很多公司还在使用 and Ruby is a really simple and easy language to learnRuby也很简单且易学 number 4: Swift第四名 Swift Swift is now the primary language for building an iOS app如今Swift是开发iPhone或iPad上 whether it’s for iPhone or iPadiOS应用的主要语言 if you have the skill由于很多公司都想开发iOS应用 it should be fairly easy for you to get a job所以如果你掌握了这门语言 since many companies want to build iOS apps那么找工作会变得轻而易举 I haven’t used this language extensively myself我自己没有深入用过这门语言 But it seems like a fairly simple and easy language to learn但看起来还是相当简单易学的 the only downside of Swift is thatSwift的唯一缺点是 it’s not really cross-platform它不是真正的跨平台 Meaning, it’s not easy to create an iOS app with Swift也就是说 如果没有Mac if you don’t have a Mac开发iOS应用会很麻烦 number 3 : Java第三名 Java Java is probably one of the most widely used programming languages today它可能是当今使用最广泛的编程语言之一 You can use Java to build many things including Android apps它可以用来构建很多东西 包括安卓应用 Many companies use Java frameworks to create websites as well许多公司也用Java框架来搭建网站 Unfortunately, it’s not the easiest language to learn不幸的是 它学起来并不是非常容易 since it’s a bit more complex than the other languages in this list相比清单中的其他语言会要复杂一些 number 2 : Python第二名 Python Python is also a very popular programming language至少在北美 Python是一种 at least in North America非常流行的编程语言 and many companies use it to create websites许多公司用它的Django和 with frameworks like Django and flaskFlask框架来搭建网站 This is probably the language of your choice如果你对数据科学 if you’re interested in things机器学习或科学之类的感兴趣的话 like data science machine learning or science in general那么Python会是你的合适之选 it’s also one of the main languages used at Google它也是谷歌使用的主要编程语言之一 so it’s popular at both large companies and smaller companies所以它在大公司和小公司中都很受欢迎 and number 1 : Javascript第一名 JavaScript Javascript used to be a language that only ran on your browser它过去是只在浏览器上运行的语言 whether it’s Chrome Firefox or Safari比如谷歌 火狐或Safari But recently people started using it to create back-end code但最近人们开始用它来写后台代码 meaning the code that runs on your servers也就是在服务器上运行的代码 not just front-end code而不只局限于前端代码 meaning the code that runs on your device也就是在设备上运行的代码 whether it’s a phone or a laptop比如手机 笔记本电脑 Javascript is a great language to learnJavaScript是一门 for getting a job很好的求职语言 And it’s also fairly simple and easy to learn而且它也非常简单易学 now if you’re just getting started with programming如果你是一名编程初学者 I’d recommend that you start with我会建议你从Python either Python or JavaScript或JavaScript开始 And I think your optimal choice here mostly depends on your interests我认为最好的选择主要取决于你的兴趣 For example if you’re interested in UI例如 如果你对界面设计或 or user experience design用户体验设计感兴趣 Then JavaScript is probably the way to go.那么JavaScript会是你的选择 If you’re more into logic, machine learning or science in general如果你更喜欢逻辑 机器学习或科学之类的 Python might be the right choice for you那么Python就很适合 Now I have three more languages for honorable mentions此外 还有三种语言我想特别提一下 But I have a quick announcement to make这里宣布一则简短的消息 I’ve just launched my patreon page where you can我刚成立了我的patreon主页 chip in a few dollars to join a private Facebook group在这里可以付费加入私人脸书小组 and a private monthly live Stream以及收看每月私人直播 where you can ask me any questions你可以在那里问我任何问题 I’d much appreciate it if you can head over to如果你能访问主页 csdojo.io/pat to support my channel来支持我的频道 我将非常感激 so here’s my honorable mention number 1 :好 这是我的第一个特别推荐 Go, which is also known as GolangGo 也被称为Golang This language was originally developed at Google这门语言最初是由谷歌开发的 but it’s used extensively in many companies today但如今它被很多公司广泛使用 Go is known for its efficiency and it’s simple syntaxGo以其效率和简单的语法而著称 And it’s actually becoming one of the事实上它已成为当今 most popular programming languages today最受欢迎的编程语言之一 So if you’re looking to add a language to your skill set所以 如果你想多掌握一门语言技能 this is the first language I’d definitely consider我绝对会推荐这个为首选 Honorable mention number 2 : Kotlin第二个特别推荐 Kotlin Kotlin is a relatively new language它是一门相对较新的语言 And it works in both a Java-based environment可以在基于Java和 and a JavaScript-based environment基于JavaScript的环境中使用 Kotlin was recently officially supported by Android安卓最近正式支持了Kotlin so it’s possible that Kotlin will become the primary language因此将来它可能会成为 for developing Android apps in the future开发安卓应用的主要编程语言 honorable mention number 3 : SQL第三个特别推荐 SQL with some people pronounce as S.Q.L.一些人也读作 S Q L If you search for what programming language to learn如果你搜索该学习哪种编程语言 you might run across on an article or two that mention SQL或许会看到一两篇提及SQL的文章 But if you’re beginner it might be confusing because但如果你是初学者 可能学起来会费解 SQL is different from all the other languages因为SQL与我们在此视频中 that we talked about in this video讨论的其他语言都不相同 SQL or S.Q.L. is a programming language thatSQL是一门仅专注于 solely focused on managing databases管理数据库的编程语言 It’s usually used in conjunction with one of the other languages通常会与其他编程语言结合使用 So learning SQL as your first programming language因此把SQL作为学习的第一门编程语言 is probably not the best idea可能不是个好主意 and in my opinion it’s not that easy to learn SQL我认为 如果没有一些实际生活中的数据 without having some real-life data to play with学起来就会比较困难 So I’d focus on learning other languages first所以我会先专注于学习其他语言 Okay, that’s all I have for this video.好了 以上就是今天的视频 Thanks as always for watching感谢你们的收看 and again I’d much appreciate it if you can如果你能访问我的patreon主页 head over to my patreon page right here to support CS dojo支持《开发大师》 我将非常感激 and let me know in the comment below如果你对我将来录制视频 if you have any requests about有任何想法 what kind of videos I should make in the future就在下方评论中告诉我吧 and I’ll see you in the next video下期视频见
  • 2021-08-24#5 数据归约Let’s imagine that you work for想象一下 你任职于 a major streaming media provider, right一家主流流媒体提供商 So you have I don’t know some 100 million subscribers有超过一亿订阅者 So you’ve got I don’t know ten thousand videos on your site你的网站上有数以万计的视频 or many more audio files, right以及数量更多的音频文件 对吧 So for each user you’re gonna have collected information你需要收集每一个用户的信息 on what they’ve watched,他们观看的内容 when they’ve watched it, how long they’ve watched it for他们什么时候看的 看了多久 whether they went from this one to this one?他们是从这个跳转过来的吗? Did that work? Was that good for them?跳转是否成功?用户体验怎么样? And so maybe you’ve got 30,000 data points per user也许从一个用户身上你就能收集到三万个数据点 We’re now talking about trillions of data points我们现在谈论的是十亿级的数据点 and your job你的工作是 is to try and predict what someone wants to watch or listen to next.试着预测用户接下来想看到或听到的内容 Best of luck.祝你好运 So we’ve cleaned the data, we’ve transformed our data我们已经做了数据清理和数据转换 everything’s on the same scale统一了数据规模 we’ve joined datasets together也合并了不同的数据集 The problem is because we’ve joined datasets together那么问题来了 数据集合并之后 perhaps our datasets has got quite large right,它变得特别大 对吧 now or maybe we just work for a company that has a lot a lot of data.当然可能我们的公司本来就有很多很多的数据 Certainly the general consensus these days的确 我们现在一般倾向于 is to collect as much data as you can right,尽可能多地收集信息 this isn’t always a good idea.但这并不总是最好的方式 We what we want remember,时刻牢记 is the smallest most compact and useful dataset we can我们要的是一个最精简 最完整 且最有效的数据集 otherwise you’re just going to be wasting CPU hours or GPU hours,否则只是在滥用中央处理器和图形处理器 training on this, wasting time.并把时间浪费在算法训练上 We want to get to the knowledge as quickly as possible我们希望尽快地从数据中提取有用的信息 and if you can do that with a small amount of data如果通过一个小的数据集就能做到 that’s going to be great.那就太棒了 So we’ve got quite an interesting dataset to look at today based on music.我们今天来看一个很有趣的关于音乐的数据集 It’s quite common these days when you’re building something like a streaming service如果你今天想搭建一个像声田这样的 for example Spotify流媒体服务器 You might want to have a recommender system你通常都会需要一个推荐系统 This is an idea where you’ve maybe clustered people它的概念是根据音乐品味 who are similar in their tastes,将用户分类 you know what kind of music they’re listening to能了解他们喜欢的 and you know the attributes of that music音乐类型及属性 and if you know that通过这些 you can say well this person likes high tempo music就能知道 比如 这个人喜欢节奏强的音乐 So maybe he’d like this track as well.那么他可能也喜欢这首歌 And this is how playlists are generated.这就是播放列表生成的原理 One of the problems is that you’re gonna have to为了能够运用机器学习 produce descriptions of the audio你可能碰到的问题是 on things like tempo and how upbeat they are你需要对音频进行描述 如其节奏 in order to machine learn on this kind of system, alright.或者判断它们是否欢快 And that’s what this dataset is about.这个数据集就是关于这些的 So we’ve collected a dataset here today.现在我们已经有了数据集 There is, lots and lots of metadata on music tracks right.它包含许多歌曲元数据 Now these are freely available tracks and freely available data,这些歌曲和数据是完全免费的 we’ll put a link in the description if you want to have a look at it yourself如果你想看 我们会在描述区放上链接 I’ve cleaned it up a bit already我已经通过 because obviously I’ve been through the process of cleaning and transforming my data.数据清洗和数据转换进行了一些处理 So we’re gonna load this now this takes quite a long time to do,我们将载入数据 这会花上好一会儿 because there’s quite a lot of attributes and quite a lot of instances因为它有大量的属性和实例 [music][音乐] It’s loaded right?载入好了吧? How much is this data?到底有多少数据? Well, we’ve got 13,500 observations我们有13500个观测值 that’s instances,即实例 and we’ve got seven hundred and sixty-two attributes, right?以及762个属性 对吧 So that means another way of putting this if in sort of machine learning parlance 用机器学习的术语说 is we’ve got 13,000 instances and 760 features. 就是我们有13000个实例和760个特征 Now these features are a combination of things.这些特征是由一系列内容组成的 So let’s have a quick look at the columns we’re looking at我们来快速看一下数据列 so we can see what this datasets about.了解一下这个数据集 So names of music all right,输入names(music_all) so we’ve got some 760 features or attributes返回了约760个特征或属性 and you can see there’s a lot of slightly meaningless text here.可以看到 有很多意义不太清晰的文字内容 But if we look at the top you’ll see但回到数据顶部 可以看到 some actual things that may be familiar to us.一些我们熟悉的概念 So we’ve got the track ID, album ID the genre, right?有歌曲ID 专辑ID和流派 So genre was an interesting one流派很有趣 because maybe we can start to use因为或许我们可以通过 some of these audio descriptions to predict what genre this music is or something like that.这些音频描述来预测音乐的流派等 Things like the track number and the track duration and,还有歌曲编码 歌曲时长 then we get onto the actual audio description features.然后才是具体的音乐特征描述 Now these have been generated by two different libraries,这些信息来源于两个不同的数据系统 the first is called Librosa,一个叫Librosa which is a publicly available library它是一个公共数据库 for taking an mp3 and calculating musical sort of attributes of it.我们可以从中读取并分析mp3类型音乐的属性 What we’re trying to do here我们现在要做的 is represent our data in terms of attributes.是用属性来表示我们的数据 An mp3 file is not an attribute. It’s a lot of data.mp3文件不是属性 它是一组很大的数据 So can we summarize it in some way?那能否通过某些方法总结它? Can we calculate by looking at the mp3?能否通过分析mp3文件进行计算? [music][音乐] What the tempo is,它的节奏怎么样 what the amplitude is, how loud the track is, these kind of things.振幅如何 音量有多大等等 This is the kind of thing we’re measuring.这些是我们想要测算的 And a lot of these are going to go into a lot of detail这些细节特征 down at kind of a waveform level.将以声波的形式呈现 So we have the Librosa features first,我们首先看到的是Librosa特征 and then if we scroll down接下来往下拉 after a while we’d get to some Echo Nest features.很快可以看到Echo Nest的特征 Echo Nest is a companyEcho Nest是一家专注于 that produces very interesting features on music.研究音乐特征的公司 它们很有意思 Actually, these are the features that实际上 这些特征是声田 power Spotify’s recommend system and numerous others.和很多其他软件在推荐歌曲时用到的 We’ve got things like acousticness.比如说原声性 How acoustic does it sound.这首歌曲的原声性怎么样 We’ve got instrumentalness.还有乐器性 I’m not convinced that‘s a word.我不太确定真的有这个词 Speechiness.朗读性 They how how to what extent is it speech or not speech, alright.歌曲中有多少歌词不是唱出来的而是说出来的 And then things like tempo how fast it is,还有节奏性 指的是歌曲节奏有多快 and valence,以及积极性 how happy does it sound, right.它听起来有多让人高兴等 A track of zero would be我猜如果得分是0 quite sad, i guess,就说明这音乐很悲伤 and a track of one will be really high happy and upbeat.得分是1就说明音乐很欢快 And then of course we’ve got a load of features.当然还有很多其他的特征 I’ve labeled temporal here and对于这些特征 我在这里已经标记上了temporal these are going to be based on the actual music data themselves.之后会根据实际的音乐数据本身进行调整 Often when we talk about data reduction,通常当我们谈到数据归约时 what we’re actually using is dimensionality reduction, alright.我们一般是通过降维来实现的 Well way of thinking about it is we可以这样理解 从一开始 as we started we’ve been looking at things like attributes and我们就接触了像属性这样的概念 we’ve been saying what is the mean or a standard deviation我们会谈论数据中某些属性的 of some attribute on our data.平均值或标准差等 Right. But actually when we start to talk about clustering但实际上当我们谈到聚类分析 and machine learning和机器学习时 we’re going to talk a little bit more about dimensions.我们需要更多地讨论维度 Now this is in many ways实际上很多时候 the number of attributes is the number of dimensions.属性的数量就是维度的数量 It’s just another term for the same thing.只是叫法不同 But certainly from a machine learning background,从机器学习的角度看 we refer to a lot of these things as dimensions.我们把这些东西叫做维度 so you can imagine if you’ve got some data here想象你有一些数据 So you’ve got your instances down here你的实例在这里 and you’ve got your attributes across here属性在这里 So in this case our music data, we’ve got each song.我们还有所有的音乐数据 So this is song one, this is song two, song three,比如说这是歌曲一 歌曲二 歌曲三 and then all the attributes of a tempo,和所有的属性 像是节奏感 Echo Nest attributes, it’s tempo and things like this.Echo Nest的属性比如节奏感之类的 These are all dimensions in which this data can vary,数据会在这些维度上有所不同 so they can be different in the first dimension, which is the track ID,比如说在第一个维度 歌曲ID上就不一样 but they can also down here be different in this dimension, which is for tempo.但它们也可能是在这个节奏的维度上有不同 When we say some data is seven hundred dimensional当我们说数据拥有700个维度时 what that actually means is it has seven hundred different ways实际上是指数据可以在700个属性上 or different attributes in which it can vary.有不同的组合方式 And you can imagine that first of all可以想象 首先 this is going to get quite big quite quickly,数据的体量会增长得很快 right, seven hundred attributes seem like a lot to me.700个属性对于我来说很多 Right, and depending on what the algorithm you’re running is,其次 具体看你用的是什么算法 it can get quite slow这么大量的数据 when you’re running on this kind of size of data.它可能会运行得很慢 And you can imagine this is a relatively small dataset可以想象 和声田每天的数据量相比 compared to what Spotify might deal with on a daily basis.这样的数据量已经算是小的了 But another way to think about this data is actually另一种理解这个数据的方式是 points in this space.空间里的数据点的分布 so we have some 700 different attributes700个属性意味着数据有 that you can vary,非常多不同的可能性 and when we take a specific track,当我们拿出一首特定的歌曲 it sits somewhere in this space它会落在这个空间内的某一处 So if we were looking at it in just two dimensions, you know如果我们只看其中的两个属性 track one might be over here,歌曲一可能在这里 and track two over here and track three over here歌曲二在这里 歌曲三在这里 And then three dimensions,如果是三个维度 track four might be back at the back here.歌曲四可能就在这个后面 You can imagine the more dimensions we add,可以想象 当我们不断加入维度 the further spread out these things are going to get,这些歌曲会更分散 But we can still do all the same things we can但哪怕有700个维度 我们也能进行 in three dimensions, in 700 dimensions.和只有3个维度时 一样的处理 It just takes a little bit longer.只是时间会长一些 So one of the problems is that有个问题是 some things like machine learning don’t like to have too many dimensions.机器学习这类技术不太喜欢处理过多的维度 So things like linear regression can get quite slow因此如果你有成千上万个属性 if you have tens of thousands of attributes or dimensions做像线性回归这类的分析就会特别慢 So remember that perhaps the the default response to anyone collecting data记住 数据收集者可能会尽可能多地收集数据 is just to collect it all and worry about it later.而并不考虑分析的复杂性 Right, this is a time of what we when you have to worry about it.但你需要考虑这个问题 What we’re trying to do is我们尝试做的是 remove any redundant variables.剔除多余的变量 If you’ve got two attributes of your music如果你的音乐中已经有了 like tempo and valence,两个几乎一样的属性 that turn out to be exactly the same,如节奏和积极性 why are we using both for making our problem a little bit harder, right.那就不用两个都保留 那会让分析更加困难 Now an actual fact Echo Nest features are pretty good,实际上Echo Nest的数据特征很不错 they don’t tend to correlate that strongly,它们之间的相关性没有那么强 but you might find where we’ve collected some data on a big scale但也许你收集了非常大量的数据 actually a lot of the variables are very very similar all the time但很多变量其实是非常相似的 and you can just remove some of them这时候你就可以剔除 or combine some of them together或合并部分变量 and just make your problem a little bit easier.以简化分析工作 So let’s look at this on the music dataset and see what we can do.我们来看这个音乐数据集 思考能做什么 So the first thing we can do is we could remove duplicates right.首先可以去除重复项 It sounds like an obvious one这听起来显而易见 and perhaps one that we could also do during cleaning,并且是在数据清理时就能完成的工作 but exactly when you do it doesn’t really matter as long as you’re paying attention.但只要留心 你可以任何阶段做这一步 what we’re going to say is music all equals unique of music all.我们来输入music_all=unique(music_all) and what that’s going to do is look for find any duplicate rows这将找到数据中的重复项 and remove them.并剔除它们 The number of rows we’ve got will drop by some amount. Let’s see.数据集的行数会有所减少 我们来看看 Thinking.思考中 [music][音乐] This is where you need a timer.你需要一个秒表 Actually, this is quite a slow process事实上这会是个漫长的过程 you’ve got to consider that we’re going to look through every single row因为电脑正在扫描每一行数据 and try and find any other rows that match.并且从中找到重复项 Okay, so this is removed a bit about 40 rows终于 我们成功剔除了约40行数据 So this meant we had some duplicate tracks.这意味着之前我们有重复项 You can imagine that可以想象 things might get accidentally added to the database twice,有些数据可能被重复载入了 or maybe two tracks are actually identical或者是两首完全相同的歌曲 because they were released multiple times or something like this.被多次发布等等 Now what this is doing,这一步的作用 the unique function actually finds rows that are exactly the same是通过unique函数找到 for every single attribute or every single dimension, of course in practice,在每个属性和维度上都相同的歌曲 you might find that you have two versions of the same track,实际上 你可能有两首完全相同的歌曲 which differ by one second,它们只相差一秒 they might have slightly different attributes.或者只有非常微小的属性差别 Hopefully they’ll be very very similar.希望它们非常非常相似 So what we could also do is have a threshold where we said我们也可以设置一个门槛来判断 these are too similar, 这两首歌太像了 they’re the same thing.它们其实是同一首 The name is the same它们的名字相同 the artist is the same演唱者相同 and the audio descriptors are very very similar,音乐描述也非常非常相似 maybe we should just remove one of them, right我们也许要剔除其中一首 This is the other thing you could do.你可以这么做 Just for demonstration what we’re going to do is focus on我们会用数据中的 just a few of the genres in this dataset right,一些流派做说明 just to make things a little bit clearer for visualizations.这会使数据可视化的结果更清晰 We’re going to select just the classical jazz pop我们将选取经典 爵士 流行 and spoken-word genres, right及说唱类型 cause these have a good distribution of different amounts in the dataset.因为它们在数据中的分布数量差异较大 So we’re going to run that.我们来运行这行代码 We’re creating a list of genres.这将创建一个音乐流派的列表 We’re going to say music is music_all我们来选取数据 Where any time where the genre is in条件是歌曲的流派属性存在于 that list of genres we just produced right,我们刚才创建的流派列表当中 and that’s going to produce a much smaller dataset这将返回一个小很多的数据集 of 1,600 observations它只有1600个观测值 the same number of attributes or dimensions.以及同样数量的属性或者说是维度 Now normally you would also keep most of your data in,一般来说 你可以保留所有的数据 this is just for a demonstration.这里只是为了做说明 But removing genres that aren’t useful to you for your experiment如果数据量过大 剔除那些对你的实验结果 is a perfectly reasonable way没有帮助的流派信息 of reducing your data size if that’s a problem.可以大大缩小数据集 Assuming they’ve been labeled right in the first place.前提是它们一开始就被正确地标注了 Assuming they’ve been labeled right in the first place.前提是它们一开始就被正确地标注了 Right, that’s on someone else. That’s someone else’s job.那是另外的人该负责的 那是他的事儿 Let’s imagine but 1,600 is still too long.假设1600个观测值还是太多了 Now actually computers are getting pretty quick.但实际上电脑运行得很流畅 Maybe 1,600 observations is fine,也许它还可以应付1600个观测值 but perhaps we want to remove some more.但我们应该还可以剔除更多的数据 The first thing we could do is just chop off the data half way首先我们可以把数据分成两半 and keep about half.只取其中一半 So let’s try that first of all,我们先来试试 so we’re going to say the first music定义音乐集一为 that’s the first few rows of our music is数据集的前一半数据 rows 1 to 835即第1行到第835行 and all the columns.及所有数据列 So we’re going to run that.我们来运行看看 And that’s even smaller.返回的数据更少了 Right so we can start to whittle down our data.我们成功地缩小了数据量 This is not necessarily a good idea.但这并不总是好的 We’re assuming here我们假设了 that our genre is equally,不同流派是平均 you know, randomly sampled around our dataset.且随机地分布在数据集中的 That might not be true.事实可能并非如此 You might have all the rock first and then all the pop or something like that.可能它的顺序是先摇滚 后流行之类的 If you take the first few,如果只选取开头几行 you’re just going to get all the rock,可能会只有摇滚乐 right depending on what you like, that might not be for you.这取决于这是不是你想要的 也许不是 So let’s plot the genres in the normal dataset,我们在最初的数据集里绘制一个关于流派的图表 and you can see that we’ve got very little spoken word,可以看到 虽然数量很少 but it is there.但还是有说唱类 we have some classical international jazz and pop还有经典 国际 爵士和流行 in sort of roughly the same amount.数量都差不多 If we plot after we’ve selected the first 50如果我们选取头50行信息 再绘制图表 you can see we’ve lost two of the genres, right就会缺失两个流派的数据 we only have classical International and jazz这里只有经典 国际和爵士乐 and there’s hardly any jazz.且爵士乐的数量还很少 That’s not a good idea.这不太妙 So don’t do that unless you know因此 除非你确定数据是完全随机打乱的 that your data is randomized.否则不要只选取头几行 So this is not this is not giving us a good representation of genres因此 如果我们想要根据音乐特征 if we wanted to predict genre来预测歌曲流派 for example based on the musical features,这些流派数据没有代表性 cutting out half the genres seems like an unwise decision.直接删掉一半的流派并不太明智 So a better thing to do will be更好的处理方式 to sample randomly from the dataset.是在数据集中随机取样 So what we’re going to do我们现在要做的是 is we’re going to use the sample function通过sample函数 to give us 835 random indices into this data来生成835个随机标记值 and then we’re going to use that并通过它们来标记出 to index our music data frame instead.我们的音乐数据样本 Alright, that’s this line here.运行这行代码 And hopefully this will give us a better distribution希望返回的数据分布能更完整 if we plot the original again,回到原始数据的图表 it looks like this是这个样子的 and you can see we’ve got a broad distribution数据分布情况完整 and then if we plot the randomized version接下来对随机处理后的版本进行图表绘制 You can see we’ve still got some spoken.在这里我们有一些说唱类型 It’s actually going up slightly,数量更多了些 but the distributions are broadly the same.但它们的数据分布基本一致 So this is worked exactly how we want.这就是我们想要的 So how you select your data如果你想精简数据 if you’re trying to make it a little bit smaller is very very important.精简方法是非常非常重要的 And consider but obviously we only had 1,600 here同时要考虑 虽然我们只有1600个观测值 and even this whole dataset is only 1,300 rows,甚至整个数据集只有1300行 you could imagine that you might have tens of millions of rows但想象一下你有上千万条数据 and you’ve got to think about this before you start just getting rid of them completely.你从一开始就要考虑这个问题 Randomized sampling is is a perfectly good way of selecting your data.随机取样是非常好的数据选取方法 Obviously, it has a risk that但显然 这也有风险 maybe if the distributions of your genres are a little bit off你的流派数据分布可能不够全面 and maybe you haven’t got very much of a certain genre.在某些流派上的数据量可能不够 You can’t guarantee你无法保证处理前后的 that the distributions are going to be the same on the way out.数据分布情况能完全保持一致 And if you’re trying to predict genre,如果想要预测流派 that’s going to be a problem.这会是个问题 So perhaps the best approach is stratified sampling.所以最佳的取样方式应该是分层取样 This is where we try and maintain这样子我们就可以保持 the distribution of our classes.各个类型的数据占比 So for example in this case genre.比如说回到流派 So we could say we already we had 50% rock我们有50%的摇滚乐 30% pop and 20% spoken,30%的流行乐和20%的说唱音乐 and we want to maintain that kind of distribution我们将选取所有数据中的50% on the way out, even if we only sample 50%, right?并希望在取样前后 数据的分布情况保持一致 This is a little bit more complicated in our but it can be done.这比较复杂 但也是可以做到的 And this is a good approach if you want to make absolutely sure with distributions而且这是使你的数据分布情况在取样前后 of your sample data are the same as your original data.完全保持一致的最佳方式 We just looked at some ways,我们刚刚讲述了 we can reduce the size of our dataset通过减少实例数量 in terms of a number of instances or the number of rows.或行数的数据归约方法 Can we make the number of dimensions那是否也可以通过减少维度 or the number of attributes smaller, right?或属性来减小数据量呢? Cause that’s often one of the problems这是一个常见的问题 and the answer is yes它的答案是肯定的 And there’s lots of different ways we can do this并且方法有很多 some more powerful and useful than others.但有一些会更有效 One of the ways we can do this is something called correlation analysis.其中一个就是相关分析 So a correlation between two attributes basically tells us that两个属性间的关联性的大概意思是 when one of them increases 当其中一个属性增加时 the other one either increases or decreases in general in relation to it. 另一个属性会相应增加或减少 So you might have some data like this with attribute one你可能会有像这样的数据 横轴是属性一 and we might have attribute two,纵轴是属性二 and they sort of look like this.看起来像这样 These are the data points for all of our different data这些是数据点 它们分布在不同的地方 Obviously we’ve got a lot of data points我们会有很多数据点 and you can see that roughly speaking它们的大致情况 they kind of increase in this, sort of direction here like this.是朝这个方向递增的 Now it might be but if this correlation is very very strong.它们的相关性可能非常非常强 So basically,因此可以说 attribute two is a copy of attribute one more or less.属性二和属性一的同质性比较强 Maybe it doesn’t make sense to have attribute two in our dataset.也许在数据集中并不需要属性二 Maybe we can remove it without too much of a problem.直接剔除它或许不会带来什么问题 Alright. What we can do is something called correlation analysis where we我们可以通过相关分析来检测 pitch all of the attributes versus all of the other attributes,一个属性和其他属性间的相关性 we look for high correlations and we decide,关注那些相关性高的属性 ourselves, whether to remove them.并决定是否要剔除它们 Now sometimes it’s useful just to keep everything in有时我们需要保留所有属性 and try not to remove them too early不要过早剔除它们 But on the other hand, if you’ve got a huge amount of data但另一方面 如果你的数据量很大 and your correlations are very high,而相关性又很强 this could be one way of doing it.就可以删除一些属性 Another option is something called forward or backward attribute selection还有另一种方法就是前向或后向属性选择 Now this is the idea that其概念是 maybe we have a machine learning model or clustering algorithm in mind如果你有成型的机器学习或聚类分析算法 we can measure the performance of that,我们可以测试算法的表现 and then we can remove features,与此同时移除特征 and see if the performance remains the same.来看其表现是否保持一致 Because if it does如果是 maybe we didn’t need those features.就可以移除那些特征 So what we could do is we could train our model on let’s say a 720-dimensional dataset.我们可以在720个维度上训练我们的模型 and then we could get a certain level of accuracy and record that.当模型达到一定的精度时 将其记录下来 Then we could try it again by removing one of the dimensions然后去掉其中一个维度 and try on seven hundred and nineteen,继续在剩下的719个维度上训练模型 and maybe the accuracy is exactly the same算法准确性可能会保持一致 in which case we can say,因此可以说 well, we didn’t really need that dimension at all, 我们并不需要这个维度 and we can start to whittle down our data this way.并通过这样的方式来精简数据 Another option is forwards attribute selection.另一个选择是前向属性选择 this is where we literally train our machine learning on just one of the attributes,我们只在一个属性上训练机器学习模型 and then we see what our accuracy is,记录下准确性 and we keep adding attributes in and retraining并添加另一个属性继续训练模型 until our performance plateaus,直到准确性不再发生改变 and we can say you know what?就可以得出结论 We’re not gaining anything now by adding more attributes.加入新的属性不会带来更多的信息 Obviously, there’s the question of which order do you try this in.显然你可能会问 怎样确定训练维度顺序 Usually randomly.通常是随机的 So what you would do is you would train on all the data比方说使用后向属性选择方法 for example of a backwards attribute selection.你需要训练所有数据 You take one out at random,随意取出一个属性 if your performance stays the same, you can leave it out.如果算法表现保持一致 就剔除这个属性 If your performance gets much worse,如果你的算法表现变差了很多 you put it back in and you don’t try that one again.就把它放回去 并且别再动它了 And you try a different one.然后再试另一个 And you slowly start to take dimensions away把这些维度一个个试一遍 and hopefully whittle down your data.希望这最终能缩小你的数据量 Let’s have a quick look at correlation analysis on this dataset.我们来快速看一下这个数据集的相关分析 You might imagine that可以想象 if we’re calculating features based on the mp3如果我们通过Librosa或Echo Nest的mp3文件 from Librosa or Echo Nest,来计算特征 maybe they’re quite similar a lot of the time.绝大多数时候它们会非常类似 And maybe we can remove them.我们也许可以剔除它们 Let’s have a quick look.来快速看一下 So we’re just going to focus on为了简要说明 one of a set of Librosa features just for simplicity.我们只看一组Librosa的特征 So we’re going to select only the attributes that contain我们来选取所有只包含 this chroma kurtosis field,这个叫色度峰度的属性 which is one of the attributes that you can calculate using Librosa.你可以通过Librosa来计算这个属性 So I’m going to run that.运行这行代码 We’re going to rename them just for whole simplicity为了方便 将它们的名字改成 to Kurt 1 Kurt 2 Kurt 3.kurt1 kurt2 kurt3等 And then we’re going to calculate a correlation matrix然后我们将通过相关矩阵 of each of these different features versus each other,来计算不同特征之间的相关性 like this.就像这样 Ok, finally, we’re going to plot this最后 我们绘制这个图表 and see what it looks like.看看它长什么样 Hopefully we can find some good correlations希望我们能够找到一些特征间的相关性 and we could have candidates若它们是冗余的 for just removing a few of these dimensions, if it’s redundant.我们就会有一些备选的待删除属性 And it’s not too bad. So you can see that we’ve got for example kurt 7 here.这结果看起来还不错 可以看到这个kurt7 So index 7 is fairly similar to 8.kurt7和kurt8挺像的 That’s a correlation of 0.65它们的相关性有0.65 Maybe that means that we could remove one over two of those.这就意味着也许我们可以剔除两者间的一个 This one here is 0.59.这里还有个0.59 We’ve got a 0.48 over here.还有个0.48在这里 These are fairly high correlations.这些是比较高的相关性 If you’re really stretched for CPU time,如果电脑配置不高 or you’re worried about a size of your dataset,或你认为数据量太大 this is the kind of thing you could do to remove them.你就可以剔除相关性高的属性 Of course, whether 0.65 is a strong enough correlation当然 0.65是否代表强相关 that you want to delete and completely remove one of these dimensions是否需要彻底剔除它 is really up to you and it’s going to depend on your situation.完全取决于你和实际情况 One of the reasons that the correlations aren’t quite as hard as you might think计算相关性可能比你想得简单 其中一个原因是 is that these libraries have been designed with this in mind.这两个数据库就是为此而生的 If you just, if Echo Nest just produce 200 features that are exactly the same,如果Echo Nest生成的200个特征都是完全一样的 it wouldn’t be very useful for picking playlists.那它对于歌单选择的参考性就很有限 So they’ve produced 200 features that are widely different.因此这200个特征是非常不同的 So we’re not necessarily going to correlate all the time, right?它们也并不一定总存在相关性 对吧? That’s the whole point and that’s a really useful feature of this data这就是这个数据库的特性和作用 We’ve looked at some ways we can try and make our dataset a little bit smaller.我们已经探讨了一些精简数据的方法 Remember our ultimate goal记住 我们的终极目标是 is a smallest most sort of useful data we can get our hands on, right.通过处理得到最精简且有用的数据 Then we can put that into machine learning or clustering然后运用机器学习或聚类分析 and really extract some knowledge.来获取其背后的信息 The problem is that尽管我们做了相关分析 what we might do may based on correlation analysis或前向或后向属性选择 or forward backwards attribute selection但还是可能产生问题 比如 We might just be deleting data.我们可能仅仅是在剔除数据 And maybe the correlation wasn’t one.因为相关性不是1 It wasn’t completely redundant因此它们并非是完全冗余的 Do we actually want to completely remove this data?我们是否真的要彻底剔除这些数据? Is there another way we can transform our data是否有别的方法能更好地 to make more informed decisions甄别应该剔除的数据 as to what we remove, and more effective ones?并获取更有意义的信息? That’s PCA or principal component analysis有的 那就是PCA技术 即主成分分析方法 At the moment, we’re just fitting one line through our two-dimensional data目前我们只是在处理二维数据 there’s going to be more principal components later, right?之后数据中会有更多的主成分 But what we want to do is we want to pick the direction through this data,而我们想做的是从数据的所有属性中 however many attributes it has, that has the most spread.挑选出分布最广的属性 并确定其方向 So how do we measure this? Well quite simply…我们如何进行测量呢? 简而言之……
  • 2021-08-24#4 数据转换People need to learn to use standardized measures for things.人们需要学着用标准化的单位来描述事物 So take me for example when I drive anywhere,比如说当我开车的时候 I drive in miles, I drive in miles per hour.我会用英里来描述 或英里/小时 My fuel economy is messaging miles per gallon,我的油耗单位是英里/加仑 but of course, I don’t pump fuel in gallons,当然 在加油时我不用加仑 I pump it in liters.我用单位升 And then but when I run anywhere so short distances但当我跑步的时候 描述短距离 I run in kilometers and I run in kilometers per hour.我用千米 或者千米/小时 So I’m using two different systems there.因此我在这儿用了不同的单位系统 And any short distances I’m measuring are going to be in meters, not feet, right.我会用米来测量短距离 而非英尺 So if I’m measuring let’s say举个例子吧 如果说我想要 around my house for painting,粉刷我的房子 I’m going to measure in square meters,我会用平方米进行测量 so I know how much paint to buy.这样才知道要买多少油漆 But then I’m selling a house, or I’m buying a house但当我买卖房子时 I’m going to be looking at the size of the house in square feet.我会用平方英尺来描述大小 Again, what, who knows why, British people.谁知道呢 英国人就这样 If I’m baking anything,如果我做烘焙 it’s going to be weight in grams or kilograms going into the recipe.我会按照食谱来用克或者千克 But if I’m weighing myself is going to be in stones and pounds.但如果测量体重 我就会用英石和英磅 But of course a ton would for me would be a metric ton但当然 对我来说 吨是公制吨 not an imperial ton.而非英制吨 And as I said, I measure fuel in liters如我所说 我用升来描述汽油 and most of my liquids are measured in liters和绝大部分其他液体 except for cause for beer and milk, which are in pints.除了啤酒和牛奶 我用品脱 So this is the kind of problem you’re going to be dealing with这就是你在观察数据时 when you’re looking at data.可能碰到的问题 You’re trying to transform your data into a usable form.你尝试将数据整理成可用的形式 Maybe the data is coming from different sources,可能这些数据来自不同的渠道 none of it goes together.它们相互不匹配 You need standardized units standardized scales,你就需要将单位和比例进行统一 so we can go on and analyze it.接下来才能做分析 So let’s think back, we回到主题 what we’re doing is we’re trying to prepare our data我们需要做的是将数据 into a densest, most clean format整理成最精简 最清晰的形式 modeling or machine learning以便我们进行建模 机器学习 or some kind of statistical test或其他统计学分析 to work out what’s going on and draw knowledge from our data.从而挖掘数据背后的原因 从数据中获取信息 So this is going to be an iterative process,因此这是一个迭代的过程 we’re going to be cleaning the data,我们需要对数据进行清洗 we’re going to transform the data并且转换数据 and then we’re going to reduce for data,然后精简数据 and transforming data is what we’re going to do today.我们今天要做的就是数据转换 So let’s imagine that you’ve cleaned your data.现在假设你已经清洗好了你的数据 So we’ve got rid of as many missing variables as possible,我们尽可能地剔除了缺失的变量 hopefully all of them with deleted instances and attributes that希望我们成功地将所有不可用的数据和属性 just we’re not going to work out for us.都剔除了 Now what we’re going to try and do我们现在要做的 is we’re going to try and transform our data是试着将我们的数据 so that everything’s on the same scale转换成统一的单位 Everything makes sense together我们希望数据是有逻辑的 and if we’re bringing datasets from different places,且如果数据集来自不同的渠道 we need to also make sure all the units are the same我们还要确保数据单位相同 and everything makes sense.确保它们都是合理的 There’s no point in trying to use machine learning如果数据是错的 那么我们用机器学习 or sum or clustering or any other mechanism求和或聚类分析或其他任何分析方法 to draw knowledge from our data if our data is is all wrong.得出的结论都毫无意义 So today we’re going to be looking at census data.今天我们会来看人口普查数据 Now census data is kind of a classic example of a kind of data在数据分析中 人口普查数据 you might look at in data analysis.是一种经典的数据类型 It has got lots of different kinds of attributes,它有很多属性 things that are going to need cleaning up and transforming.并且需要清洗和转换 So we’re back in our we’re going to read the census data我们先来读取人口普查数据 using census is read CSV输入> census So we’ve downloaded some census data that我们已经下载好了 represents samples from the US population to begin with.美国人口普查信息的一些例子 We’re going to read that in and you can see that我们来读取它 可以看到 we’ve got 32,000 observations and 15 attributes我们有32000个观测值和15个属性 or variables.或者说变量 So what are the first math.这是我们的第一步 So let’s have a quick look at just a little bit of it接下来我们通过一些例子 and we can see the kind of thing we’re looking at.来了解这个数据 So we’re going to say head of census输入head(census) and that’s just going to produce the first few rows这将生成数据的头几行 so we can kind of see the kind of data.以便通过它们大概了解一下数据 So you can see we’ve got age可以看到 我们有年龄 we’ve got what working classification that person has, their educational level工作类型 受教育程度 and numerical representation about whether they’re married or not this kind of thing以及代表他们婚姻状况的数字等 So there’s a lot of different kinds of data here这里有很多不同的类型 some of it is going to be nominal有的是定类数据 So for example, this working-class比如说工作类型 state government, private employee.我们有政府部门 私企等 That’s a nominal value.这是定类变量 We might have ordinal values or ratio values我们也可能有定序变量 定比变量 or interval values或定距变量 We’re gonna have to delve into a little bit closer to find out what these are.我们需要进一步去探索这些是什么 Now what we do to transform this data要是想要将这些数据转换成 into a usable format for clustering or machine learning可以进行聚类分析或机器学习的形式 is going to depend on exactly what these types of these columns are我们需要弄清这些列是什么类型 and what we want to do with them以及如何处理它们 So let’s look at it just a couple of the attributes我们来一起看几个属性 and see what we can do with them, right?并看看能怎么处理它们吧 We’re going to use a process called codification.我们将通过编码来实现它 The idea is that maybe things like random forests or编码背后的逻辑是 像随机森林 multi-layer perceptrons, you know neural networks多层感知器 人工神经网络等技术 aren’t going to be very amenable to putting in text-based inputs.它们无法直接处理文字类型的数据 So what we want to do is try and replace these attributes因此我们需要把这些属性替换成 with a numerical score.具体的数值 All right. So let’s look at just for example of a working class,好吧 接下来看一些例子 比如工作类型 and also for example the educational level. So education.和教育程度 写下教育 Now work class is the kind of class of worker that we’re looking at here工作类型是指人们的工作类型分类 So for example a state worker or in private sector,比如说在政府部门 私企 or someone that worked in a school or something like this.或者在教育机构工作 诸如此类的 Now this is a nominal value.这是定类变量 That means there’s no order to this data at all这意味着它们无法进行排序 we can’t say but someone in state is higher or lower than someone in private我们不能说政府部门比私企更高或更低 and we can’t also say but let’s say state is two times more or less than some other one.也不能说政府部门是其他类型的三倍或1/3 That makes no sense at all. Alright.根本讲不通 对吧 So what we can we can replace this with numbers.我们可以用数字来进行替换 so let’s say we could replace private with zero比如说我们把私企替换成0 and state with one政府部门换成1 and you know, self-employed with two and so on, right自由职业者换成2等等 And that we’ve got back perfectly reasonable thing to do,这样做没有问题 but it’s still nominal data.但它们还是定类变量 So what we can’t do is then calculate a mean and因此我们无法对其求平均值 say “ah the mean is halfway between private and public”也不能说平均值落在私企和政府部门之间 that doesn’t make any sense.这不合逻辑 Just because something has been replaced by a numerical score把值替换成数字 doesn’t mean that it actually represents something that we can quantify in that way, right?并不是量化这些值 对吧? It’s still nominal data.它们还是定类变量 Okay, so I bet the best advice I can give is因此我能给你最好的建议就是 feel free to codify your data into easy-to-read numbers你可以将数据转换成容易阅读的数字 but just bear in mind that但记住 you can calculate the mode just like you know the most common,你可以求众数 也就是出现最多的数据 but you can’t calculate the median and you can’t calculate the mean.但你无法计算中位数或平均数 Another example would be something like the educational level.另外一个例子就是教育程度 Now theoretically this is ordinal data,理论上这是定序变量 so we could save it someone with a an undergraduate degree我们可以说一个拥有本科学位的人 is maybe slightly higher in terms of their the amount of time they spent in education,在其受教育过程中 可能比高中毕业的人 than someone with a high school diploma.花费了更多的时间 But we don’t know exactly what the distance is,但我们无法准确计算其差值 and what’s the distance between let’s say a high school and a degree and then a PhD,即我们无法算出拥有高中 本科 博士 and so on an MD and things like this.医学博士等学位的人之间的差值是多少 We can represent these using numbers,我们可以用数字来表示它们 and probably in order, right,甚至是有序的数字 so we could say that zero is no education比如说把未接受过教育标为0 and one is sort of the end of primary school把小学毕业标为1 and two is the end of high school and so on and so forth高中毕业标为2等等等等 But again,但再强调一次 it’s difficult to calculate distances between these things我们很难计算出它们间的差值 We don’t know what high school is two times more than primary school我们不能说高中毕业比小学毕业多两倍 and half of a degree or something like that.或是其他学历的二分之一之类的 That doesn’t really make sense.这完全不合逻辑 So again,再强调一次 you might be able to calculate a median on this or a mode,你可能可以计算中位数或众数 but you can’t calculate an average.但你无法计算平均值 You can’t say the average level of education你不能说教育的平均水平 is halfway between high school and undergraduate.在高中毕业和大学毕业之间 That doesn’t make any sense either.这也不合逻辑 So for any kind of attribute that is nominal or总结一下 对于用文字表示的 possibly ordinal and it’s sort of represented using text定类变量或定序变量 we can codify this so that it’s more amenable to things like我们可以对其进行数字编码 以便于根据 decision trees depending on the library you’re using, right?你用的数据库来进行如决策树之类的处理 But you just have to be careful all machine learning algorithms但你必须小心 所有的机器学习算法 will take any number you give them都会接收你提供的一切数字 and you just have to be careful that this makes sense to do.你要做的是仔细确认它们符合逻辑 So what you would do is you would go through your data所以说你得仔细检查数据 and you’d begin to systematically replace appropriate attributes并且系统地对合适的属性 with numerical versions of themselves,用数字做替换 remembering all the time,时刻牢记 that they don’t necessarily represent true numbers,它们并不是真正的数字 you know in a ratio or interval format.它们不是定比或定距变量 So for any text-based value,那么对于文本类型的值 we’re going to start with replacing possibly with numerical scores.我们用数字去替换 What about the numerical values?那么那些数字类型的值呢? Well, they might be okay,它们应该没有大问题 but the issue is going to be one of scale.但还是需要注意它们的范围 you might find for example in this census data比如说在这个人口普查数据中 that one of the dimensions你可能会发现 有些维度 or one of the attributes is much much larger than another one.或说有些属性比其他高出许多 So for example, this dataset has hours per week举个例子 这个数据集有小时/周 which is obviously going to be somewhere between naught and maybe 60 or 70 hours显然 它们区间会在0到60或70之间 for someone has got, you know a very strong work ethic,因为有些人有很强的职业道德 and salary, right?还有的是为了工资 对吧? Or salary or income or any other measure of, you know, monetary gain.薪酬或收入 或是其他形式的金钱收入 Now obviously hours per week is going to be in the tens and很显然 小时/周会是两位数 Salary could be into the tens of thousands. Maybe even the hundreds of thousands薪酬则可能是四位数 甚至是五位数 Those scales are not even close to being the same.这些数字范围区别很大 That means if you’re doing clustering or machine learning这意味着 如果对它们做聚类分析 on this kind of data或机器学习 you’re going to be finding the salary你会发现薪酬这个变量 is kind of overbearing everything, right超出其他变量许多 So it’s going to be very easy for your clustering所以聚类分析可以很容易地 to find differences in salary,发现薪酬的差别 and it’s harder for it to spot differences in hours,但发现工作时间的差别则很难 because they’re so small in comparison, right?因为它们太小了 对吧? So we need to start to bring everything onto the same scale.因此我们需要统一所有属性的范围 The more attributes you have数据的属性越多 which is another way of saying, the more dimensions you have to your data,即数据的维度越多 then the further everything is going to be spread around.这些数据的分散程度就会越高 If we can scale all of these values to between如果可以将数值范围全部整理到 sort of let’s say around 0 and 1,比如说在0到1的区间 then everything gets more tightly sort of controlled in the middle,数值就会比较集中地聚集在中间的区域 And so it gets much easier to do clustering这样我们就能更容易地进行如聚类分析 or machine learning or any kind of analysis we want.或机器学习等我们想要运用的分析方法 So let’s look back at our data现在让我们回到数据 and see what we can do to try and scale some of this into the right range.并来试着将它们整合到合适的区间范围 So we’re going to look back at the head of our data again我们回到数据顶部再来看看这些数据 so our numerical values are things like the capital gain可以看到 数字类型的数据有资本收益 the capital loss which I guess资本损失 presumably how much money they’ve made in the loss that year,应该是指他们去年亏了多少钱 probably for normalize them on some scale也许我们需要统一它们的单位 and then things like the hours per week that they work.还有像他们每周的工作时间 and their salary which at this case is greater than or less than 50,000.以及他们的薪酬 大约是在50000上下 So let’s have a quick look at the kind of range of values we’re looking at here快速看一下这些数值范围 来帮助我们判断 so we can see if scalings even necessary是否有必要重新定义范围 Maybe we got lucky运气好的话 and the person did it before they sent us the data给我们数据的人可能已经做了这一步 So we’re going to apply a function across all the columns接下来我会对所有列使用一个函数 and we’re going to calculate the range of the data来计算它们的范围 So this is going to be apply on a census data输入apply(census,2) division 2, so that’s all of our columns,参数是2 这就将所有列都包含在内了 and we’re going to use the range function for this,我们还需要用到range函数 and this is going to tell us okay,这将告诉我们 so for example the age ranges from 17 to 90比如说年龄的范围是17到90岁 the educational level from 1 to 16教育程度的范围是1到16 It gives you the range for things like nominal values as well,它也会返回定类变量的范围 but they don’t really make any sense但它们并不合理 I mean working class ranges from question mark to without pay,比如说工作类型范围是从问号到没有收入 you know is meaningless.这不合逻辑 And then so for example capital gain ranges from zero to nearly one hundred thousand,再比如说资本收益的范围是0到大约十万 and capital loss from zero to four thousand.而资本损失的范围是0到4000 And finally the hours per week ranges from 1 to 99,每周工作时间的范围是1到99 So you can see that the capital gain可以看到 资本收益 is many orders of magnitude larger in scale than the hours per week.比每小时工作时间大好几个数量级 We’re going to need to try and scale this data.我们需要转换数据范围 We’ll begin by doing to make our lives a little bit easier.为了后续能轻松点 我们直接开始 It’s just focus on the numerical attributes right,我们只需处理数字类型的属性 so we’d have to worry about the nominal values, which we’ve not codified yet不用管那些我们还没编码的定类变量 We’re going to select all the columns from the data where they are numeric.我们需要选取所有用数字表示的列 So that’s this line here, and paste that down here.找到这行代码 把它复制粘贴过来 So we’re going to s apply that applies over each of the fields is it numeric,接下来我们输入sapply(census, is.numeric) and that’s going to give us a logical list这将通过判断数据列是否为数字 that says true or false depending on whether those columns are numeric.而返回一个真/假值的列表 What we’re doing here is selecting from this list any bit of true我们需要选取所有判断结果为真的列 and then finding their names.和它们的名称 So what are the names of a columns for the numeric?那么这些是数字的列的名字是什么? So let’s have a look at just a range of these attributes为了之后轻松点 to make our life a little bit easier.我们来稍微看一下这些属性的范围 So I’m gonna run this line运行这行代码 and so this is a simplified version of what I was just showing,这是一个简化的版本 you can see that capital gain is massive比如可以看到资本收益 compared to the hours per week for example.比每周工作时间大很多 Let’s have a look at the standard deviation.接下来我们来看标准差 the call that the standard deviation, is the average distance from the mean,标准差指的是数值和平均值的平均差值 so it kinda gives us an idea of the spread of some data, right.我们可以通过它大致了解数据的离散程度 Is it very tight and everyone owns roughly the same如果数据很集中 这意味着数值大小相似 or is it very spread out and it’s huge deviations.如果数值很分散 标准差就大 And the answer is there’s pretty huge deviations.答案是 这组数据的离散程度很高 So the age has a standard deviation of 13 so it, obviously年龄的标准差是13 显然 that means that most people are going to be kind of in the middle这意味着大多数人的年龄 and on average they’re going to be 13 years younger or older,会比平均年龄大或小13岁 but you can see that things like capital gain have over 7,000 standard deviation,而资本收益的标准差则超过了7000 which is a huge amount.这数字很大 To give you some idea what we’re aiming for,稍微说明一下我们的目标 it’s very common to standardize this kind of data.对这类型的数据进行标准化处理是很常见的 So the standard deviation is 1 right.我们会得到的标准差是1 So, 7,000, much too big.7000这个数字太大了 Let’s plot an example我们来看个例子 to gives you some idea of what the kind of problem is when we have these massive ranges.为什么会得出这么大的数字范围呢 So I’m going to plot here a graph of age versus capital gains, right我来绘制一个关于年龄和资本收益的图表 We know age goes between about one and a hundred我们知道年龄的范围是1到100 and capital gain is much much larger.而资本收益范围则大很多 So if I run this运行这行代码 basically the figure makes no sense at all,这个图表基本上没有什么意义 because the capital gain ranges from zero to one hundred thousand因为资本收益的范围是从零到十万 and as a few people earning right at the top scale,尽管有一部分人收入很高 分布在顶部 everything is sort of squished down the bottom.但绝大部分都被挤在下面 We can’t see anything that’s going on.我们无法知道其内在联系 There’s no way of telling whether我们也无法看出 the capital gain of an individual is related to their age.个体的资本收入和他们的年龄是否有关 I mean it probably is, right它们应该是有联系的 Cause retired people, people who are very young,因为退休和特别年轻的人 perhaps earn slightly less.可能赚得比较少 We can’t really see that here,这个图表无法得出这样的结论 because it’s just too compressed, right因为数据都挤在一起了 We need to start trying to bring these things together我们需要重新转换这些数据 so that we can perform better analysis.才能得出更好的分析结论 What we’re going to do is creating a new data frame我们要建立一个新的 with just the numerical attribute.只含数字属性的数据框架 so we want to focus on just to make our life a little bit easier我们的目标是简化后续的分析工作 and then we’re going to write a normalized function to我们会用归一化函数 move all our data to between 0 and 1,将所有的数据映射到从0到1的区间 and we will do this per attribute.我们逐个逐个属性来 So for example, if you’ve got some data which goes between a minimum and a maximum举个例子 这些数据里有最小值和最大值 and we want to scale this data to between 0 and 1我们想将它们整理到从0到1的区间 All we need to do is first of all, take away the minimum,首先我们要拿掉最小值 and that’s going to move everything to be这将把整个数据移到从0 from 0, to max minus min.到最大值减去最小值的范围 And then we’re going to divide by this distance here,然后我们用它来除以这个差值 so this is max minus min.即最大值减去最小值 And if we divide by this everything is going to go from 0 to 1.处理后 所有数值都会被转换到0到1之间 So that’s exactly what we’re doing in this function here这就是这个函数的功能 we’re gonna function X输入function(x) and it subtracts the minimum of X(x-min(x)) and then divides by the difference between the maximum and the minimum alright./(max(x)-min(x)) So this is very standard. So I’m going to run this.这是非常标准的处理 来运行它 I’ll let you write functions like this and then use them我会让你也在数据里也写出这样的函数 in applications over data.然后对数据进行处理 So we’re going to calculate a normalized census dataset,我们将对人口普查数据进行归一化处理 which is we’re going to apply over dimension to就用我们刚写的归一化函数 this normalized function we just wrote.在这个屏幕里 And then now if we look at the range will see that our range is now现在可以看到 所有数据的范围都 between 0 and 1 for all of our data, which is exactly what we want.在0到1之间 我们得到了想要的结果 The normalization is a perfectly good way of handling your data.归一化是非常好的数据处理方法 If everything is between 0 and 1如果所有数据都在0到1之间 we have fewer problems with the scale of things being way off right.那就不用太担心数据的单位问题了 Now some statistical techniques like PCA我们会在另一个视频中讲到如PCA that we’re going to talk about in another video即主成分分析技术 They require standardized data,它们要求数据是标准化的 that’s data is centered around zero,即数据以0为中心分布 has a mean of zero and a standard deviation of one.平均值为0 且标准差为1 Now we can standardize data pretty easily in the same way.现在我们能很轻松地用同样的方法 将数据标准化了 Actually, we don’t need to write our own function for this,实际上我们不需要自己来写这个函数 the scale function in R performs this for us.R语言里的scale函数就可以实现 So we’re going to take the census data over numerical attributes我们将选取人口普查数据中的所有数字属性 and we’re going to call the scale function然后我们用scale函数 and that’s going to take all of the attributes把所有的属性 and center them around their mean,都放到平均值周围 so that means the mean will become close to zero这意味着它们的平均值将趋近于零 and it’s going to divide them all by the standard deviation然后它将计算标准差 so their standard deviation becomes one.标准差的结果会是1 So if we run that and then we have a look at the mean of this data我们来运行代码 来看一下它的平均值 So for example here, we calculate the mean.举个例子 我们来计算平均值 You can see that I mean these values are very very close to one可以看到 数值的平均值都非常接近1 That’s 10 to the minus 17 or something like that, very very small.这里是10的-17次方 非常非常小 And if we look at the standard deviation, and similarly, they’re all going to be 1.类似地 它们的标准差也非常接近1 Alright, so this is now standardized data.现在我们就有了一组标准化的数据 This is a very good thing to do这对于你后续进行 if you want to use your data in some kind of machine learning algorithm or some kind of clustering.如机器学习或聚类分析 是非常有好处的 Let’s imagine now that we want to join some datasets together.现在我们来看怎样将合并不同的数据集 So we standardize data everything’s between 0 and 1,我们已经将数据归一化到0到1的区间了 or it’s centered around 0 with a standard deviation of 1,它们以0为中心分布 标准差为1 we’ve codified some attributes.且对部分属性进行了编码 What happens if we get other data from other sources?那如果我们从其他渠道获取额外的数据 会怎么样? You can imagine that census data from the US might be a bit useful.可以想象 美国的人口普查数据也许有些用处 But maybe we want census data from Spain但也许我们还想要其他地方的数据 or from the UK or from another country.比如西班牙的 英国的等等 Can we join all of these together我们可否合并它们 to get a bigger more useful dataset? Alright.以得到更完整有用的数据集? Now the thing to think about when you’re doing this,现在你需要考虑的是 is just to make sure that everything makes sense, right?确保你做的每一步都符合逻辑 对吧 Are the scales the same?数据规模都统一了吗? Are they all normalized or none of them normalized?它们是否经过了归一化处理? Because otherwise, what you’re going to be doing is you’re going to be adding, you know,否则 你可能会把在0到十万之间的数 pay between naught and a hundred thousand, to somewhere between naught and one,和在0到1之间的数加在一起 nothing makes any sense anymore.这样做就没有意义了 You’re gonna wreck your data.你会把数据毁掉 So let’s have a look at this on the census dataset.我们再来看一下这个人口普查数据 We have some Spanish census data in a very similar format我们有一些西班牙的人口普查数据 to our census data from the United States.它们和美国的数据形式很相似 Let’s have a quick look.我们来看一下 So I’m going to read the CSV file of Spain data.我将会读取西班牙数据的CSV文档 Let’s remind ourselves of the columns that we had in our census data from the United States.别忘了 我们已经有了美国的数据 These are the numerical columns,它们都是数字类型的 so we have age, education number比如年龄 教育程度 capital gain capital loss this kind of thing.资本收益和资本损失等 Let’s look at the Spanish dataset来看一下西班牙的数据 to see if we can just join the two together.看能否直接合并这两个数据集 So I’m gonna run head Spain,我将运行head(Spain) that’s going to give us the first few rows这将返回数据的头几行 and you can see that可以看到 there’s some of the stuff in there is as it was before有些内容和之前的数据一样 so things like what their level of education is,比如说教育程度 whether they work in the private sector or the public sector, right.或者是他们是在私企还是政府部门工作等 We’re going to need to remove these things我们需要剔除这些 to create just a numerical attributes.只保留数字类型的数据 And the other problem is if you look carefully,如果仔细观察 你会发现另外一个问题 you’ll see that the capital gain in the Spanish dataset is in euros,在西班牙数据里 资本收益的单位是欧元 not in dollars, right.而非美元 Now that’s a huge problem.这会带来很大的问题 They don’t they’re not massively different obviously虽然它们的差距并不明显 they’re on the same order of magnitude是在同一个数量级上的 But we don’t want to be jamming但我们并不想 capital gain in euros next to dollars把以欧元和美元为单位的资本收益放在一起 because those two scales are not the same, right?因为毕竟单位不统一 So what we need to do first所以我们首先需要 is scale this data using some kind of exchange rate.通过转换汇率来统一数据的单位 So here what we’re going to do is we’re going to create a new column in Spain现在我们将在西班牙的数据中新建一个列 so given a Spain data frame,在西班牙数据集中 we’re going to say the Spain capital gain is equal to the我们将输入公式 西班牙的资本收益等于 Euro capital gain times by 1.13,欧洲的资本收益乘以1.13 which is the exchange rate we’re going to use.这是我们会用到的汇率 Now It’s quite important in this kind of situation在这种情况下 很重要的是 not just to look up the exchange rate online.你不能在网站上随便找一个汇率 You’ve got to consider but this might have been collected a while ago你需要考虑 这些数据可能是一段时间前收集的 What was the exchange rate when this data was collected right,这些数据被收集时的汇率是多少? these are things you’re going to have to think about.这是你需要考虑的 So let’s run that line,运行这行代码 and let’s do the same thing for the capital loss.接下来对资本损失做同样的处理 Now we’re going to keep just the numerical attributes of现在我们成功地只保留了美国和西班牙 our census data and of the Spanish data,人口普查数据中的数字属性 and we’re also going to add another column,我们还要添加另外一列 that is what country they come from,即它们来自哪个国家 otherwise we’re not going to know.否则我们就无从得知国家来源 So we’re going to use the columbine function接下来我们来用columbine函数 to combine the census data as numerical attributes把人口普查数据中的数字属性 and the native country which in this case will be the United States.和来自于美国的数据合并起来 We’re going to do the exact same thing for the Spain data,接下来对西班牙的数据做同样的处理 which will be basically exactly the same步骤基本上是一样的 except obviously we’re also going to have Spain as the native country.除了我们要在国家这里写上西班牙 And then we’re going to use the rowbind feature然后我们就可以用rowbind函数 to just join those two tables together把这两个表格合并起来 Now that will only work if those two datasets have the exact same attributes.这个函数只有在两个数据集属性完全相同时才能使用 ‘nu_census’ is not found.无法找到’nu_census’ What did I do wrong?我哪里出错了? So I had a typo.原来我打错了 So let’s join these two together using rbind.再用rbind函数试试 There we go. And so our United dataset now has成功了 现在我们的美国数据集 the combined observations for the United States and Spain.就拥有了美国和西班牙的数据了 Now, what you wouldn’t want to do is just join them together需要注意 我们不能简单地合并这两个表格 and just leave it at that, right.然后就不管了 You want to perhaps have a little look at some plots to make sure that应该要通过一些图表观察数据分布状况 the distributions of the data you’ve just joined together make sense.来确保刚才合并的数据是合理的 For example, alright,比如说 the United States data has a nice broad distribution of different ages.美国的数据在年龄上的分布跨度比较大 We want to make sure that the Spanish data has that same distribution我们希望西班牙数据的分布情况也类似 Otherwise, you’re kind of going to skew your dataset.否则 数据就可能出现偏差 So, for example, let’s have a look at roughly whether the levels of capital gain比如说来看看在美国和西班牙数据集中 are approximately the same for both the United States and the Spanish dataset.它们的资本收益水平是否相当 So I’m gonna use ggplot for this. We’re gonna plot a bar chart我会使用ggplot函数创建一个柱状图 where we’ve color-coded United States and Spain,它会用不同颜色表示美国和西班牙的数据 and you can see that broadly speaking可以看到 大致上 there’s a lot in the kind of around zero or less than 50k,大部分的数据都分布在0到50000之间 and then there’s a few a little bit above.有小部分在50000以上 Alright, so that looks broadly speaking the same distribution.因此它们大致上的分布是差不多的 I’m fairly happy with that.我认为这还不错 This is gonna be a judgement call当你在处理数据时 when you get your own data.你需要自己进行判断 So I’ll clear the screen清屏 and then let’s have a look at the next plot.来看下一个图表 So the next plot is going to be capital loss versus the native country.这个图表显示的是两个国家的资本损失 Let’s make sure those distributions are the same.我们希望它们的分布情况也是类似的 So it’s posting there and broadly speaking again yes,在这儿可以看到 它们的情况也大体类似 the majority are down the bottom,绝大部分数据集中在底部 and then there’s a few United States ones在顶部也有一些数据 and a couple of Spanish ones up at the top as well.美国和西班牙都是这样 Again, it’s not a disaster.这并不意味着你出错了 That’s probably ok.这也许就是实际情况 Finally, let’s have a look at ages by native country.最后我们来看各国的年龄分布情况 So if we plot this,来创建这个图表 we can see two very very similar distributions.可以看到它们的分布情况也非常非常相似 You can see that it’s essentially a bell curve.大体上这是一个正态分布 Maybe slightly skewed towards older participants稍稍往老年人的方向偏离 for the United States and very very similar for Spain. This is okay.在美国和西班牙都是这样 这很不错 If we hypothesized that如果我们假设 capital gain, capital loss and salary资本收益 资本损失以及薪酬 was something to do with your age,都与年龄有一定关系 then it would make sense to have two datasets that you’re joining together那么就年龄而言 你所合并的这两个数据集 have very similar distributions in this regard.都应该具有类似的分布情况 So let’s look at one more dataset from Denmark.我们再来看一下丹麦的数据 Alright, so it’s the same thing, same format.它也是一样 有着相同的内容 We’re gonna read the CSV,我们将读取CSV文档 and we’re going to have a look at just the top few rows to make sure it’s in the same format,通过它的头几行来确认它有同样的内容 so that’s using a head function,head函数可以实现 and you can see actually we’ve already removed the nominal可以看到 我们已经剔除了定类变量 and other text attributes from here和文字类型的变量 and we’ve just got the numerical ones.只保留了数字类型的变量 And actually also capital gain and capital loss而且像资本收益和资本损失的单位 are already in dollars in this dataset也已经是美元了 so we don’t have to perform a conversion.因此我们不必去进行汇率转换 So we can use rbind to put these two things together,我们用rbind函数把这两者连接起来 and now we just need to check the distributions are the same.现在只需要看它们的分布情况是否一致 So again,和前面一样 we’re going to put the age against the native country,我们把年龄按国家分组 and see if these towards the same distributions.来看国家间的年龄分布是否一致 And you actually you can see this isn’t looking too good.这结果看起来不太妙 The United States and the Spanish datasets美国和西班牙的数据 have very similar distributions.分布情况非常类似 The participants or the people who have been polled from Denmark are much much older on average, right?但丹麦数据的平均年龄则大很多 This could have an effect on things like capital gain,这可能对资本收益等也会产生影响 so I wouldn’t necessarily feel comfortable just joining this dataset in,因此我并不会简单地就合并这个数据集 without you thinking about it a little bit more closely.除非经过更深入的思考 Alright, so总结一下 whenever you’re joining dataset like this taking data from different sources,当你想要合并不同渠道的数据 think carefully, to make sure that it’s fair要认真思考 以确保你的处理 and what you are doing is a reasonable, concatenation of datasets.能得出合理且可以被合并的数据集 And actually these are the features实际上今天讲的这些处理方法 that power Spotify recommender system and numerous others.被用在了声田和很多其他软件的推荐系统中 So we’ve got things like acousticness.比如说原声性 How acoustic does it sound from这首歌曲在原声性上 from a zero to a one?能在0到1间打多少分 We’ve got instrumentalness.还有乐器性 I’m not convinced that’s a word.我不太确定有这个词 Speechness.很容易上口吗 That, how, how, to what extent is it speech or not speech, alright.这首歌在多大程度上是朗朗上口的 And then things like tempo…还有像节奏……
  • 2021-08-24Unity 5 官方教程 #3The Camera doesn’t move由于摄像机不能移动 and from its current position cannot see very much.以及它所处位置的视野有限 We need to tie the Camera to the Player gameobject.因此我们需要把GameObject和摄像机绑定起来 First let’s set the position of the Camera.先来设置摄像机的位置 Let’s lift it up by 10 units把它升高10个单位 and tilt it down by about 45 degrees.倾斜大约45度 Next let’s make the camera然后把摄像机设置为 a child of the Player gameobject.GameObject的子对象 This is a typical third-person setup.这是一个典型的第三视角设置 With the Camera as a child of the Player,由于摄像机是玩家的子对象 when we move the Player’s position当我们移动玩家的位置时 the Camera moves with it.摄像机也会随之移动 When the Player rotates the Camera rotates as well.当玩家旋转时 摄像机也随之旋转 Let’s look at this from a position我们看向这个 where we can see both the Player and the Came gameobject.能同时看到玩家和摄像机的位置 Move the Player,移动玩家 rotate the Player,转动玩家 the child Camera moves with it.子对象摄像机也随之移动 Now let’s reset the Player and test.现在我们重置玩家来进行测试 We enter Play mode, hold down the up arrow to move,进入游戏模式 按住向上键来移动 Whoah! What’s happening here?哇哦!发生了什么? Okay, well as the Camera is a child of the Player’s Sphere,摄像机作为玩家的附属部分 even though the Camera is not moving at all尽管摄像机相对于 relative to the Player’s gameobject,玩家的GameObject没有移动 the Player gameobject is rotating like crazy但玩家的GameObject却在疯狂滚动 so the Camera’s point of view rotates with it.所以摄像机的视角也随之滚动 Let’s exit Play mode.我们退出游戏模式 Unlike a normal third-person game,和普通的第三视角游戏不同 our Player gameobject is rotating on all 3 axes我们玩家的GameObject是同时在3个坐标轴上旋转 not just 1.而不是一个 In a typical third-person setup在典型的第三视角设置中 the Camera as a child of the Player gameobject摄像机作为玩家的GameObject的子对象 will always be in a position relative to its immediate parent.相对于其父对象通常处于一个位置 and this position will be the parent’s position in the game这个位置是由游戏中父对象的位置决定 modified or offset by any values in the child’s Transform.但会因为子对象任何值的转变而修改或偏移 We can’t have the Camera as a child of the Player,因此摄像机不能设置为玩家的子对象 so let’s detach it.我们把他们拆分开 Our offset value will be the difference这里的偏移量是 between the Player gameobject and the Camera.玩家的GameObject和摄像机间的差值 Now we need to associate the Camera with现在我们需要将摄影机和 the Player gameobject,玩家的GameObject关联起来 not as a child,不是作为子对象 but with a script.而是通过脚本关联 Using the Add Component button, choose New Script.使用“添加组件”按钮 选择新脚本 We are writing in C#,用C#来编写 and name the script “CameraController”把脚本命名为“CameraController” and then click on Create and Add,点击 创建并添加 or simply hit the Return or Enter key to confirm our selection.或者直接按回车键确定 We should note, this way of creating a script will create我们应该知道 这样创建的脚本 that Script Asset on the root会在根目录或者项目视图的顶部 or top level of our Project view.生成脚本资源 File CameraController in the Scripts folder把CameraController放入脚本文件夹中 and open it for editing.然后打开它 进行编辑 We need 2 variables here:这里需要2个变量 A public gameobject reference to the Player一个玩家GameObject的公有引用 and a private Vector3和一个私有Vector3类型变量 to hold our offset value.用来存放我们的偏移量 Offset is private偏移量是私有变量 because we can set that value here in the script.因为我们可以在脚本中修改它的值 For our offset value对于偏移量 we will take the current Transform position of the Camera,我们用摄像机位置的变化值 and subtract the Transform position of the Player减去玩家位置的变化值 to find the difference between the two.得到两者间的差值 So in Start()在Start方法中 we can make offset equal to our让偏移量等于 “transform.position – player.transform.position”“transform.position – player.transform.position” And then every frame we set our然后每一帧我们都将 “transform.position” to our“transform.position”设置为 “player.transform.position + offset”“player.transform.position + offset” This means as we move our player with the controls on the keyboard这意味着 当我们通过键盘控制玩家移动时 that each frame before displaying what the Camera can see,在显示摄像机镜头的每一帧之前 the Camera is moved into a new position aligned with the Player object.摄像机会跟随玩家移动到对应的新位置上 Just as if it were a child of that object和它作为玩家的一个子对象时一样 if it were not rolling around the game board.但玩家在游戏界面上滚动时则不同 However, Update() is not the best place for this code.然而 Update() 不是最适合这组代码的方式 It is true that Update() runs every frame,确实 Update()每一帧都会运行 and in Update() each frame we can track并且在Update()中 每一帧我们都能追踪 the position of the Player’s gameobject玩家的GameObject的位置 and set the position of the Camera.从而设置摄像机的位置 However, for follow cameras, procedural animation但是 对于追踪摄像机 程序动画 and gathering last known states,和收集最后已知状态 it’s best to use LateUpdate().最好使用LateUpdate() LateUpdate() runs every frame, just like Update().LateUpdate()和Update()一样每帧都会运行 But it is guaranteed to run但它只有当 after all items have been processed in Update().所有对象在Update()处理完成后再运行 So when we set the position of the Camera所以当我们设置摄像机的位置时 we know absolutely that the Player has moved for that frame.应该确定玩家在这一帧中已经移动过了 So let’s test this.让我们来测试下 Let’s save our script and return to Unity.先保存脚本 然后回到Unity First we need to create a reference to the Player gameobject首先 我们需要创建一个玩家GameObject的引用 by dragging the Player gameobject把玩家的GameObject拖动到 into the Player slot in the CameraController’s component.CameraController组件的玩家插槽中即可 Enter Play mode.进入游戏模式 And now we get the behavior we want.现在 我们得到了我们想要的效果 The Camera follows the rolling ball without rotating…摄像机跟随着滚动着的球 但自身没有滚动 Even as the ball goes over the edge.即使球滚出了边缘 也不会动 In the next assignment, we will set up the basic play area在下一章节中 我们将在指定的游戏区域 and create and place our special pickup objects.创建并放置特殊的随机物品
  • 2021-08-24#3 数据清洗Well, we’re looking at chocolate datasets today,我们今天看关于巧克力的数据集 so I thought I’d bring some research.我带来了一些用于研究 Yeah, good and definitely relevant嗯 味道不错 且绝对和主题相关 We’ve been looking at techniques like data visualization我们一直在研究像数据可视化这样的技术 to try and explore our data来试着探索我们的数据 and start to draw some initial, you know, conclusions or hypotheses并尝试得出一些初始的结论或假设 We’re going to start to move towards kind of modeling our data现在开始 我们将对数据进行建模 and actually trying to extract proper knowledge from this data并尝试从这些数据中提取正确的信息 because remember just because we’ve got data仅仅拥有数据 并不意味着我们 doesn’t mean we’ve got knowledge.能得到有用的信息 Now this is going to be a kind of iterative process.接下来我要讲的是一个迭代过程 We’re going to need to clean up our data我们需要清理我们的数据 to make it as useful as possible尽可能提高其可用性 We need to transform it我们需要转换数据 so that we can combine datasets together这样我们才能连接数据集 And statistically measure our datasets.还需要统计测量我们的数据集 And then we’re going to need to reduce it sometimes有时数据集太大且不好管理 if our data set is too big and unmanageable.我们还要做数据归约 and this combination of cleaning data这种数据清洗 and transforming data and reducing data数据转换 及数据归约的组合 is a kind of cycle where we’re going to iterate this until是一个循环 我们需要不断迭代操作 our dataset, is a smallest most useful form it can be直到数据集变成最精简和最有用的形式 So if we’ve got redundant variables有时候会存在冗余的变量 which are basically the same as others即它们跟其他变量基本是重复的 or we’ve got duplicates或是完全重复 These are all problems that we need to sort out.这些都是需要解决的问题 Because if we’re going to be analyzing data with these kinds of issues,因为如果带着这些问题来分析数据 we’re just making our life slightly more difficult.那只会让分析过程更困难 It’s computationally inefficient,这样计算效率很低 and you know, in the worst case,且在最坏的情况下 we could draw the wrong conclusions.还可能得出错误的结论 You might be surprised and disappointed当你刚开始拿到数据集时 when you get your first dataset你可能会惊讶和失望 that actually it’s not quite as nice as you were hoping right因为它没有你想象的那么好 It’s gonna need some cleaning up.我们需要做数据清洗 Maybe there’s missing values可能缺失了某些值 Maybe there’s outliers that need to be dealt with 也可能要处理异常值 because this yeah they’re warping your distributions and your and your medians and means 因为它们会扭曲数据分布 中值和均值 And perhaps you also got noise in your dataset, right也许你还会碰到数据噪声 对吧? these few things we can start to address with cleaning我们可以通过数据清洗来解决这些问题 So cleaning data is essentially the idea of因此 数据清洗的本质是 trying to correct or fill in any missing values试着修正或填补缺失值 Or remove those bits completely.或是完全删除那些数据位 You might be surprised that it’s missing data at all你可能感到惊讶 居然会遗漏数据 I mean oh, what, are we not paying attention?我的意思是 哦 难道我们那么不小心吗? Like we’ve got one job and that was to collect the data我们只有一项工作 那就是收集数据 and we seem to have got missing data而我们似乎遗漏了数据 But actually it’s quite common但这实际上很正常 because I mean, for example if you’re因为比如你要 if you’re trying to track patient records over time你要长时间跟踪患者记录 Maybe we didn’t show up to an appointment也许我们失约了 or maybe you in a hospital when they weren’t there或是他们需要测量体温时 when they needed to have their temperature taken你在医院 他们却不在 right and then your trend line of your temperature然后随着时间的推移 over time is going to be missing some values你所记录的温度趋势线将丢失一些值 Maybe you’ve got data going back for hundreds of years也许你有之前数百年的数据 and they didn’t have certain techniques那时他们还没有某些技术 and certain measurement systems back then和某些测量系统 so they only have other kinds of data所以他们只有其它类型的数据 so missing data is very common.所以遗漏数据是很正常的 We’re gonna have to be able to deal with it.我们只需要知道怎么来处理 So the dataset we’re looking at today is a kind of ratings for chocolate bars.我们今天要看的 是关于巧克力棒评分的数据集 This is why I ate my chocolate这就是为什么我开头在吃巧克力 or at least that’s why I’m telling myself至少我是这样告诉自己的 So we’ll read the data in我们来读取数据 we’ve got ten different variables共有十个变量 We’ve got about 1,700 nearly observations有大约1700个观察值 and let’s have a quick look using the summary我们用摘要函数快速浏览一下 So we’ve got things like the company who produce the chocolate,我们可以看到例如巧克力的制作公司 the name of the chocolate巧克力的名称 reviews, the cocoa percentage,评论 可可百分比 the type of bean, this kind of information可可豆种类 之类的信息 Right and you can imagine what you might do if you were trying to produce better chocolate可以想象 如果你想生产出更好的巧克力 is a look at a huge amount of this kind of data你会需要研究非常大量的数据 and work out what it is that customers like and what it is they don’t like从而弄清客户喜欢什么 不喜欢什么 Right this is going to be quite common in market research这在市场调研中是非常普遍的 So the first thing we’re going to do我们要做的第一件事 Right, we’ve received this data是的我们已经有数据了 We know now what the columns are,我们现在知道有哪些列 but we don’t really know anything else other than this但除此之外我们一无所知 So we’re going to have to start looking through and seeing first of all,因此 首先我们必须仔细看看 is there any missing data?是否有缺失的数据? So we’re going to use the sapply function for this我们将使用sapply函数 The sapply functionsapply函数 will apply a function over our dataset将把某个函数运用到整个数据集上 so for each column or each attribute of our data we’re going to apply this它的对象是数据的每个列或属性 And the function we’re going to use we’re writing ourselves我们将自己定义这个将被运用的函数 So it’s going to be the sum for anytime where our item is either blank or na我们定义 当数据单元为空或为NA时 Now blank means an empty string and na means literally not applicable为空意味着空的字符串 NA意味着缺失值 Right, which is something that comes up in data from time to time这些都时不时会出现在数据中 Alright, so in any case both of these are missing values,总而言之 这些都是缺失的数据 so we’re going to treat them both the same我们会用同样的方式处理它们 So if we apply this to our chocolate dataset如果把这个函数运用到巧克力数据集 then we’re going to see that for example,可以看到 例如 there are eight missing names缺少了八个名字 There are seven missing review dates,还缺少了七个审核日期 and there are four missing coco percentages以及四个可可含量百分比 So for each row in our data,从行的角度看 there are four rows where the Coco percent is missing, right共有四行数据缺少可可百分比信息 That’s not too bad, four, I mean,四个还不错 我的意思是 this is a dataset of nearly 1,700 items.它总共有近1700个数据单元 Four is not too bad.只缺四个不算多 That’s quite quite expected和预期的一样 You might imagine that可以想象 if you’re pooling this data from lots of different sources如果这些数据是你从不同渠道获取的 People are going to forget to add data in,人们可能会忘记添加新数据 or they weren’t able to record data on that day或者某天他们无法记录数据 There’s a huge amount of reasons why you might have missing data.造成数据缺失的可能有很多 And now it starts to become a little bit of a problem但当看到可可豆种类时 when we look at things like bean type.问题有点严重了 Because bean type has got 1,200 missing values因为它有1200个缺失值 That’s a huge proportion of a dataset.它占了数据集的很大一部分 And in that case在这种情况下 we might have to do something about this.我们可能需要对此进行处理 So the only issue we’ve got is that 1200 is not relat… It’s just an amount of rows.这里的问题是 1200是行数 It’s not relative to the size of the dataset.它与数据集的大小无关 So we’re going to use the exact same function any empty rows接下来我们会用同样的函数 but this time we’re going to divide by the total number of rows但是这次我们要除以总行数 so we can get a percentage for how much of a missing data we’ve got.这样就能算出缺失数据的百分比 So we can see for example that company name比如可以看到公司名称 has zero missing data,没有缺失任何数据 whereas bean type has 74 percent missing data.而可可豆种类缺少的数据则占74% So that’s a huge problem这问题比较严重 Now a kind of general rule of thumb is根据一般的经验 if you’ve got over half your data is missing如果你缺少了一半以上的数据 it’s going to be quite hard to estimate or guess what that data is going to be你就很难估计或猜测缺失的数据是什么 That’s when you might want to start thinking about removing it.你可能需要考虑将它完全剔除 So what we want to do is we want to extract any of the names of any of our attributes所以我们要做的是提取出属性的名字 that have this sort of over let’s say 60% missing.条件是属性缺少超过60%的数据 So we’re going to start by calculating all the percentages所以我们先计算出所有属性的百分比 and saving them in a variable.并将其保存在变量中 And then we’re going to select only those percentages, where the value is over point six, right?然后只把所有超过0.6的百分比选出来 60 percent.即超过60% So we’re gonna say any attribute where the attribute is over point six我们输入 查找百分比超过0.6的属性 and that is just bean type at point seven four.结果仅返回了 可可豆种类 0.74 Or seventy four percent.或说74% So we can now delete bean type ourselves现在我们手动删除可可豆种类 so we could say something like choco all the rows对巧克力数据集的所有行进行处理 for bean type is null可可豆种类为空 by setting that to null, that’s just going to delete that column将其设为空 可以删除对应的数据列 We can also do it automatically我们也可以自动操作 so we could actually pass in实际上我们可以传参 those attributes that we just calculated as a parameter用刚计算出的那些属性作为参数 So that would be this line here是这行代码 So it’ll be something like choco, all rows, that’s here,找到巧克力数据集中所有符合条件的行 在这里 the names of any attributes where the percentage missing is greater than 0.6.条件是缺失超过60%的数据属性 And that’s going to just delete bean type.这就可以删除可可豆种类 There’s not a lot we can do about bean type.对于它 我们能做的不多 We’ve only got 25%-ish of the data.我们只有25%左右的数据 It’s not enough to start guessing数量太少 我们无法猜测 what bean types are going to be in other chocolate bars.其他巧克力棒的可可豆种类 Let’s have a look at now our rows of data.现在来看看我们的数据 For each instance they’re going to have a number of attributes,每一个实例都会有一些对应的属性 now there’s nine left.现在还剩九个 And we want really to keep the instances that have the majority of their data, right我们希望能保留含有大部分数据的实例 So we’re gonna apply, right, so this is going to be row wise我们会对所有行进行处理 to dimension one, so that’s the rows输入1 代表行 we’re gonna count any of a blank or n/a计算每一行中 为空或NA的 for each row over our dataset,数据单元数量 and we’re going to put this into missing把它们标记成缺失 So what it is going to do is return a list of values of every single row这会返回一个列表 that tells you how many missing items are there in that row.它会告诉你数据行里缺少了多少数据 So we can now look at the first few missing items.我们来看看开头的几个缺失值 So we’re going to order them, right, by larges first把它们从大到小进行排序 And then we’re going to show just the first few.我们只看开头的这几个 And you can see that actually some of them are missing seven and six attributes.可以看到 有些行缺了7个属性 有些缺了6个 That’s quite a serious situation这问题比较严重 Because it was only nine, right.因为总共就只有九个属性 So eventually they’ve only got a couple of entries in their fields.这意味着它们缺失了大部分的信息 Now let’s do this again as a percentage of the number of attributes现在我们来算缺失的百分比 So this is exactly the same thing步骤是一样的 but this time we’re dividing by the number of columns,但这次我们将除以列数 which is nine.是9 and we’re going to have a look at the top of these.来看最开头的这几个 and so you can see that we’re missing 77% of some of these initial attributes,可以看到我们缺了77%的属性 That’s a real problem.问题很严重 Missing is the same length as the number of rows we’ve got.缺失的长度与我们得到的行数相同 So we can actually look up any rows因此我们可以选取出所有 where there’s a greater percentage of missing values that we want缺失百分比很高的行 and just remove them from the dataset.并把它们从数据集中删除 So what we’re going to do that is a bit like this.我们这么来操作 We can say choco is choco, anywhere where我们对巧克力数据集的所有行进行操作 missing is less than null point seven and then all the columns.选取那些缺失百分比小于77%的行 And what that’s going to do is select only the rows we want这可以让我们只保留那些 where they’ve got a nice amount of data.有大部分数据的行 So the choco dataset is going to be a little bit smaller now, but much more useful to us现在数据集变小了 但其有用性也提高了 We don’t really want to be trying to do things like machine learning or statistics我们并不想对缺失70%的数据集 when 70% of some of the data is missing.进行机器学习或统计分析 Right, that isn’t going to be a good idea.是的 那不是个好主意 So it’s quite easy just to delete data, right?可以看到 删除数据非常容易 对吧? I mean in some sense, it’s just more convenient to do that.从某种意义上讲 这样做会更方便 In general the rule is that if you’ve got more than 50% or 60% missing data,一般来说 当缺失数据超过50%或60%时 it’s a good idea to delete it, right?最好把它删掉 Delete either the instances or the attributes是删除实例还是删除属性 depending on how much data you’ve got missing and where.取决于丢失的数据量以及其所在的位置 if you’re missing a huge amount of data如果丢失的数据量太大 then you’re not going to be able to recreate it by let’s say using an average, right?你就无法用像平均值之类的数字去填充它 We’ve got so little data因为已知的数据太少 that an average isn’t going to be reliable.它们的平均值并不可靠 If we have got sufficient data that反之 如果已知数据够多 we could maybe start to infer what these missing values might be.也许可以根据它们来推测缺失的数据 We can start to try and replace them我们可以尝试替换 instead of deleting them.而非删除缺失值 So what we might do举个例子 is we might for example set them all to zero.我们可能可以将它们全部设为零 Maybe if an attribute is missing we can say well okay,也许如果缺少某个属性 我们可以说好吧 if it’s missing, we’ll just not use it如果某个值缺失了 那就不用它了 and we’ll say it’s zero.把它设为0 Now, whether you do that is going to depend on what the attribute is.而是否将其设为0 取决于属性的情况 Something zero is not a useful property.有时0对某些属性并不适用 Right and we’ll look at that in the chocolate dataset in a moment.我们之后将看个具体的例子 What we might also do is我们也可能 we might start to add the dataset mean into those attributes将数据集均值添加到这些属性中 So maybe we don’t know what the rating for this chocolate bar is比如我们不知道某个巧克力棒的评分 but we can guess但可以推测 that it’s going to be around the average rating for any chocolate bar.应该和其他巧克力棒的平均情况差不多 Again, this is going to depend on your situation, right?再强调一次 这将取决于你的实际情况 You’re still making up data in some sense从某种意义上说 你还是在编造数据 You’ve got to be very careful about what you do here.因此你必须非常小心 So we’ve deleted as much of our choco data as we feel comfortable doing now.现在我们成功删除了数据集中该删除的部分 Now let’s see if we can接下来我们看是否 fill in some of missing values with appropriate replacements.能用合适的值来替换缺失的数值 So let’s have a look at our attributes.来看看我们的属性 Alright, so we’ve got company, name, reference, things like this.我们有公司 品牌名 索引值 诸如此类的 Bean type has been removed,可可豆种类已经删掉了 but we still got things like the bean origin and the ratings,但可可豆产地和评分 and there’s a few of these missing from our dataset. 还缺了一些信息 Can we estimate these rather than completely removing them from the dataset?我们是否能估算并保留它们? Obviously the less data you use,显然 你的数据量越少 the less useful things like machine learning are going to be.像机器学习这类方法的有效性就越有限 So let’s look at an attribute and see what we can do.我们来看一个属性 并看看我们能做什么 So if we look at bar price,来看巧克力棒单价 and that’s the price of each chocolate bar这指的是单个巧克力棒的价格 we can see that there’s a few missing values somewhere around 3%.可以看到大约缺失了3%的数据 That’s something we want to deal with.我们需要对此进行处理 But we’ve got enough data, you know 97%但我们已经有了97%的数据 这足够了 maybe we can start to guess what the prices of these chocolate bars might be.也许可以试着猜测缺失的巧克力棒价格了 Now this is a good instance这是一个很好的例子 of a time when you wouldn’t want to just populate with zeros, right?你不会直接用0填充 对吧? No chocolate bar is free, I wish.没有免费的巧克力棒 我想得很美 And so what we need to do is produce a reliable value因此我们需要计算出一个可靠的值 to represent an unknown price,来替换这些缺失的价格 rather than just setting them all to be zero.而非仅仅用0代替 So what we could do here is something like this.因此我们可以这么操作 We could set every missing bar price to be我们可以把所有缺失的价格 the average of all the chocolate bar prices.设置成其他巧克力棒价格的平均值 and that way at least we’re not warping our distribution up or down.这样起码我们不会扭曲我们的分布 We’re keeping it exactly the same.这样可以使它跟之前完全一样 We’re gonna say for the chocolate dataset我们需要找到巧克力数据集中 for any row, where bar price is n/a and for all columns,巧克力棒价格为NA的单元格 we’re gonna set the bar price to be把它们的价格设置为 the mean of all the bar prices.所有巧克力价格的平均值 And we’re gonna obviously remove any NAs from that calculation of what it’s not going to work显然我们要剔除不可操作的NA值 And that’s already worked.已经运行好了 So now if we recalculate our missing values,如果我们重新计算巧克力价格的缺失值 you’ll see that bar price now has zero missing values可以看到它有0个缺失值 So we’ve fixed that problem. Great.问题解决了 很棒 So that was quite an easy one, right这很简单吧? Bar price seems to me to be quite an intuitive time看到像巧克力棒价格这种属性 when you would just calculate an average and put it in.第一反应就是用平均值来填充缺失值 Right now actually, maybe not because但实际上可能没那么简单 因为 you know, bar price might depend on where in the world we’re selling it巧克力棒的价格可能还取决于它的销售地 or you know, what company is producing the chocolate bar.或者具体是哪一家巧克力棒生产公司 So could we do the same thing for rating?那评分也可以这样吗? If we look, if we take the sum of all the NA values in rating来看一下 来对评分属性的NA值进行求和 It’s eight. Right, so there are eight chocolate bars for which there is no rating有八个 共八个巧克力棒没有评分 So what we can do is we could do something called a stratified replacement因此我们可以做的是分层替换 We could say well, let’s group our chocolate bars by country or by company我们把巧克力棒根据国家或公司分成一组 calculate those averages,计算它们的平均值 and then we can specifically fill in companies missing ratings然后就可以用对应的公司和销售地价格做填充 based on what that company actually show in the market来填补缺失的评分 rather than just an average over everything而非用所有数据的平均值 So what we’re going to do is we’re going to calculate an aggregate function因此 我们要用aggregate函数 over of the ratings by company计算各个公司的评分 And we’re going to calculate a median还有中位数 Median is a little bit more robust to outliers中位数会抵消异常值的影响 So maybe you make up a very very expensive or very very cheap line.也许你会有非常昂贵和非常便宜的产品线 The median will get what middle value is, right?中位数指的是中间的值 对吧? So this is going to be per company这会返回每个公司的值 and we can set the columns to be a little bit more helpful using colnames我们还可以用colnames函数 使结果更清晰 and so now our per company if we look at it,现在我们可以看到的结果 is each company and the median rating of chocolate bar from是每个公司 及其对应的评分中位数 I think, one to five.应该是从1到5 This is how this dataset is going.数据集就像这样 So now we know that data per company,现在我们知道了每个公司的数据 we can actually fill those in.就可以填充它们了 Now you could automate this process.你也可以自动执行此过程 We don’t have much missing data.缺失的数据不多 So let’s just show an example of doing it by hand让我们来手动处理一个例子 So this is the line of code we’re going to do and I’ll talk through it这是我们要运行的代码 我会边做边讲 So we’re going to say for the chocolate dataset我们要找到数据集中符合条件的数据 for any value where the chocolate rating is n/a for missing, right?条件是巧克力评分是NA 意味着缺失 and the company is Vicuna且公司名为Vicuna We want to set the rating to be equal to我们希望将评分设置为 The Vicuna entry in our new per company average or median.Vicuna对应的平均数或中位数 and that’s going to fill in that value there会用这个值替换NA So we do this for all the missing companies对所有公司都做同样的处理 and what we’re going to find is that we’ve replaced all our missing values with我们就能成功地用合适的公司评分中位数 appropriate medians for those ratings per company.替换所有缺失值 So the last thing we might find in our data is outliers.我们最后要处理的是数据中的异常值 So let’s imagine we do a box plot of cocoa percentage, right?我们来对可可百分比画一个箱形图吧 So I’m going to produce a box plot of a cocoa percentage我将根据可可百分比绘制一个箱形图 Now, maybe our assumption is that cocoa percentage in some way我们假设可可百分比在某种程度上 informs what the rating is going to be,和评分相关 because maybe a higher cocoa percentage tastes nicer.可能可可百分比越高 味道就越好 I don’t really know about chocolate.我对巧克力不太了解 So if we look at this box plot,来看这个箱形图 what we’ll see is we’ve got actually quite a tight distribution of cocoa percentage right可以看到 可可百分比的分布情况比较集中 between about 50% and just above 80%范围在50%到80%多一点 But you can see there are three outliers when it produces a box plot但在图中可以看到三个异常值 R will show outliers is anythingR语言中对于异常值的定义是 that is more than three standard deviations away from the median与中位数相差超过三个标准差 What we do with these outliers is going to be a judgment call,我们需要对这些异常值做出判断 it’s going to depend on a situation要看具体情况 So, for example, we have an outlier here, which is above a hundred percent,比如这个异常值超过了100% now that makes no sense.这不合逻辑 We can’t have a chocolate bar with more than a hundred percent cocoa,巧克力棒的可可百分比不会超过100% right, it doesn’t make sense对吧 这不合理 So that is obviously a mistake,因此这显然是错的 we would delete that item right,我们会把它删掉 and probably delete the whole row也许需要删掉整行数据 or reestimate that value based on a stratified average or a different average 或者重新用分层平均数或另外的平均数做替换 For these lower ones, this is a judgment call.我们还要判断这些低百分比 One is just above 20 and one is up closer to 30%有一个稍微超过20% 还有一个接近30% I don’t know whether those of outliers or not, right?我不确定这些是不是异常值 Is it possible to make a viable chocolate bar with 20% cocoa?巧克力棒可以只含20%的可可吗? I mean it maybe, right.有可能吧 You’re going to have to know a little bit something about your the situation that your data was collected in 你需要对于数据收集的背景有一些了解 and whether that’s reasonable. 来判断数据是否合理 So you might for example, delete the bottom one as a mistake,比如你可能会删掉这个20% 因为它是错的 but keep the top one because that’s just a low amount of cocoa.但保留30% 因为那只是可可含量比较少而已 So this is what cleaning data is about这就是数据清洗 We’re going to have missing data, we’re going to have outliers,我们可能会有数据缺失 异常值 we might have noise.甚至数据噪声 and you’re going to have to look for your data,你需要好好研究数据 and try and condense it and remove all these problems并试着解决这些问题 精简数据 We do this so that we can more effectively transform our data later我们这么做是为了之后更好地做数据转换 and also reduce our data if we need to.在必要时还要减小数据集大小 And then eventually your datasets going to be really nice最后你的数据集质量会很棒 so that we can do things like modeling or machine learning on it.这样就可以做建模或机器学习之类的了 …per hour. My fuel economy is messaging miles per gallon,……每小时 我的油耗单位是英里/加仑 but of course, I don’t pump fuel in gallons, I pump it in liters.但当然 在加油时我不用加仑 我用单位升 And then but when I run anywhere, so short distances然后当我跑步的时候 描述短距离 I run in kilometers and I run in kilometers per hour.我用千米 或者千米/小时 So I’m using two different systems there.所以我用了两种不同的单位系统
  • 2021-08-24#1 究竟什么是数据?What is data? Right.数据是什么? I’m pretty sure that’s data, right?我非常确定这就是数据 is this data?这是数据吗 This picture? Or that,data?这张照片 或者这个 是数据吗? Is this data? What what is data?这是数据吗?到底什么是数据? 什么是数据? 《电脑狂热》 So we talked a lot about data in last video上期我们谈了很多关于数据的内容 Why is it important that we can analyze and understand data?为什么学会分析和理解数据如此重要? but what is data?但什么是数据? Everybody has data everybody’s generating it.人人有数据 人人都生产数据 Companies are generating on us.公司生产关于我们的数据 We’re generating it ourselves,我们自己也生产数据 you know when we use social media,so on.比如使用社交媒体时 等等 but what is it and但是 数据是什么呢 Understanding what it is is a prerequisite for being able to use it properly.摸清数据概念是能合理使用数据的前提 Perhaps the most important thing as far as we’re concerned,对我们这些想要科学地分析数据的人来说 So people who are trying to analyze data sort of scientifically is也许最至关重要的 the data has to be measurable, right?就是数据本身须能度量 对吧 so the idea is, you know, if you’re going to do a survey on what people like.因此 如果你要调查人们的喜好 Everyone’s got to be using the same scale and the same rating system.则每个人都应使用同样的度量衡和评估体系 Otherwise, it doesn’t make any sense.否则没有什么意义 Well, we can’t have someone rating things from one to five我们不能让一人用12345进行评分 and someone else saying I thought it was good, right?而另一人评价说“好” 对吧? Because which one of one to five is good.因为不知道12345几分算好 We don’t you know, we don’t know.All right.我们根本不知道 So everyone is going to be doing the same thing所以每个人行为一致 your data’s got to be a consistent format搜集的数据格式也会一致 and once that’s achieved at least.一旦两者一致 We’re a little bit closer to be able to make some sense of it.至少收集到的数据会更有意义一点 Broadly speaking when we talk about data, we kind of have four different types广义而言 我们所说的数据包含四种类型 and we summarize this with this nice noir word.我们可以把它总结为一个单词 noir So n, o, i, r, noirn o i r noir(黑色) And each of these different types of data we can do different things with,right?不同类型的数据有不同的处理方式 So n that’s the first type,so this is nominal data.“n”是第一种 即称名数据 The nominal data is where we have no distance between the values that we can measure.称名数据下的测量值无法进行衡量比较 Right?Because they’re not really quantities and we can’t order them.对吧? 它们并非数量 因此无法排序 So a good example would be,colors.颜色就是个很好的例子 So maybe you have your favorite color is red And my favorite color is blue.或许你最喜欢红色 而我最喜欢蓝色 I don’t know which is better than the other.我不知道哪个更好 There is no measurement between them,right?它们根本无法比较 对吧 Is blue closer to green? the matter is?蓝色更像绿色吗? 这有什么关系吗? You know, that doesn’t make any sense, right?这种比较没什么意义吧? We’re not talking about wavelengths.我们不是在说波长 We’re just talking about the colors, right?我们只聊颜色 是吧? Another good example would be,let’s say,in football.再举个好点的例子 比如 足球 player numbers on your back right now足球队员背后的号码 symbolically sometimes certain player numbers have a meaning.现在有时特定足球号码有着某种象征意义 but you can’t compare and contrast them但它们无法用来对比比较 You can’t say that 8 is 2 times better than 4.你不能说8号比4号好一倍 All right, that doesn’t make any sense, right?这没什么意义 对吧 You also can’t really order them in general,right?你也不能按大小给它们排序 对吧? player 16 doesn’t go before or after player 13 in the list,队伍中16号与13号球员没有先后之分 but you know, but that doesn’t make any sense, right?这种排序并没有实际意义 是吧? So nominal data is data where and it’s useful, right?所以称名数据很实用 It could be really important,有时十分重要 but it’s data where we we kind of have labels,称名数据有标签 But no way of ordering these labels.却无法按标签排序 so you can still analyze it,但你仍然可以对它进行分析 but you can’t for example calculate the average that the mean average,right?却不能 比如 计算平均值 是吧? That wouldn’t make any sense.这样做完全没有意义 What you can do is calculate the mode.你只能计算众数 so you can calculate the most common one.就是计算出现频率最多的数 You could say that more people prefer red to blue.你可以说 比起蓝色 更多的人喜欢红色 but you couldn’t say you know the average color that people like is a sort of muddy brown right.但你不能说人们喜欢的平均颜色是土褐色 That doesn’t make any sense at all, right这根本毫无意义 对吧? So as we go down this list,顺着四种数据类型往下走 we get slightly more and more informative in some sense types of data我们慢慢了解到信息量越来越大的数据类型 So the next one is ordinal.下一种数据类型是有序数据 so in ordinal data,在有序数据中 we have an order but we can’t measure distances between things.有序数据可排序 但数值间的差距无法度量 so a good example would be something like打个好点的比方 Positions people finished in a race.跑步比赛的名次 So, you know, maybe I finished first可能我跑第一 I’m super quick right?我很快吧? you didn’t,you finished third你不快 你排第三 But how far we are apart that isn’t included in that kind of data但这名次无法体现我们之间相差的距离 You’d have to have a separate value for that所以你还得再测一个数值 another example what we’re all familiar with再举一个我们都熟悉的例子 is rating systems, right?评分系统 熟吧? So perhaps you I rate a film from one to five stars我打一到五星给电影评分 and you rate the film from one to five stars你也打一到五星给电影评分 but you can’t really say that但你不能说 a film that’s got four stars is two times better than one that scored two一部四星的电影比一部二星的电影好两倍 Because that’s a very subjective由于评分非常主观 and it’s there’s no real sort of measurable distance between these stars星级评分间的差异无法具体衡量 if you have ordinal data you can calculate the mode again.有了有序数据 你仍然可以计算众数 You can calculate the most common value of all the values that were returned也能计算所有统计数据中出现次数最多的数值 or you can calculate the median the one that sits in the middle, right?或是计算这组数据的中位数 So maybe you know fifty runners in a race如果竞跑中有50个跑步选手 the 25th position roughly speaking is going to be you know around the median大致来说第25个就是中位数 So it’s still not hugely useful, right这种数据使用价值依然不大 the next up we have interval data接下来是区间数据 interval data, we have an order and we have a distance,区间数据既能排序 也能衡量 but we have no sort of absolute zero for this scale但没有绝对零点这一说法 So a good example would be something like degree Celsius or degrees Fahrenheit华氏度与摄氏度是一个特别好的例子 Zero degrees Celsius isn’t no temperature.0摄氏度并非没有温度 It’s it’s a specific temperature, right?而是一个具体的温度 对吧? So we can’t say that fifty degrees is half of a hundred degrees所以我们不能说50度是100度的一半 The numbers are but doesn’t really make sense, right?数虽如此 但这样做并无意义 对吧? They are we can we can say that a hundred degrees is hotter than 50,我们可以说100度比50度热 which is hotter than zero, right?50度又比0度热 对吧 So this is interval data这就是区间数据 now interval data lets us do a few more things than we could with ordinal比起有序数据 除了求众数与中位数 as well as be able to calculate the mode and median,区间数据的应用范围更广 we can now calculate the mean temperature. That’s okay现在我们可以求平均温度了 完全可以 And we could also calculate things like the rain区间数据可用来测量降水量 the minimum and maximum temperatures for a certain window, right?或某一窗口的最高与最低温度之类的 对吧? So that’s pretty useful还是挺有用的 another good example of interval will be PH level再举个好点的例子 PH值 right again,the PH of zero means very acidicPH值为0时意味着酸性很强 It doesn’t mean there is no acidity at all or no PH at all.而不是说没有酸性或没有PH值 We can say that a PH13 is higher than a PH7 is higher than a PH3我们可以说PH13比PH7或PH3高 And we know how far apart these numbers are我们也知道这些数值的差是多少 but we can’t necessarily say if one is double one another one但我们不必说这个数是那个数的一倍 So the final kind of data we’re going to look at is ratio data最后一种数据类型是比值数据 So this is exactly like interval, except that we now have a sort of true 0 value它与区间数据几乎一样 只是加入了真零值 So a good example of this would be degrees Kelvin,right.绝对温度是一个典例 So Kelvin has an absolute zero which is绝对温度有绝对零度 the absolute average absence of any kind of heat right这个温度意味着没有任何热量 and then it goes upwards数字越高温度越高 so we can say that in terms of Kelvin所以以开为单位计算的话 a hundred is Half of 200可以说100开是200开的一半 and so on like this诸如此类 and we can get to 0并且我们可以得到0 another example would be number of children, right?再举个例子 孩子的数量 对吧? Zero children means the absence of any children0意味着没有孩子 and you can also say that let’s say four children is double the amount of two children你也可以说四个孩子是两个孩子的两倍 And two men to look after in my opinion对我来说 四个孩子就意味着要两个大人去照看 So that is an example of ratio data这是比值数据的一个例子 Right now ratio data is quite similar in terms of what you can calculate to interval,就计算而言 比值数据与区间数据类似 but it allows some more complicated statistical measures such as t-test但可以进行较为复杂的计算 如t检验 So these are the types of data这些就是数据类型 now actually, it’s quite important how you structure your data in general实际上 如何整体架构数据十分重要 We can’t just have it sitting in some massive spreadsheet我们不能仅仅把数据塞进繁多的电子表格中 with no thought given to where everything is, right?而完全不知道如何查找 对吧? There’s actually a pretty standard way of doing this实际上 查找数据有一种十分标准的方法 that we’re going to look at让我们来看一下 Data comes in lots of forms, right数据来源方式多种多样 对吧? different types of measurements,different experiments,不同的测量类型 不同的实验 people are going to collect it in different ways人们收集数据的方式也不同 But actually there’s a very standard way that we use但是实际上只要数据在电脑上 to represent data once it’s actually on a computer就有一种非常标准的方式来摆放数据 so we can have some kind of table of our data我们可以根据数据做一个表格 We almost always represent our data in a matrix like this two dimension table我们总是用这种二维表格的矩阵来查看数据 because it’s much easier to do因为做起来更容易 and so along the top we’re going to have our attributes,right?我们会在表格上方标明属性 对吧? which are the the things we’ve been measuring也就是我们想测量的事情 So an example would be maybe we’re collecting data on people比如 可能我们搜集个人信息时 so we could have name that would be some nominal data会写上姓名 这会是一个称名数据 and then, you know age,height然后有年龄 身高等 So the columns are attributes or the things we’ve been measuring因此这些列是各种属性 也就是度量类别 the rows those are the instances or the samples we’ve got这些行则是我们搜集的所有案例与样本 so that’s all the individual people所以这包括所有个人 So here’s person 1 and person 2 person 31号 2号 3号 and person 3 is called John3号的名字是约翰 and there you know 54 and you know 5 foot 11 or whatever, you know whatever right and so on年龄54岁 身高5尺11寸 随便填 and you can put you know have as many rows as you want你想填多少行都可以 so when we talk about attributes.所以我们谈到属性时 We’re talking about the number of columns就是在谈这些列的数字 people use lots of different terms for these.但人们对其的称谓各不相同 I like to think of them as features我就较喜欢把他们称作特征 attributes is another one属性是另一种称呼 and we have instances or samples down the rows在行上 我们列出得到的案例和样本 now quite often on the very last column of your data sometimes separated out通常表格的最后一列有时会被单独列出 but not really important.但不太重要 We’ll have our output现在输出数据 Maybe we’re trying to make a decision based on these people我们也许可以试试基于这些人进行决策 Maybe these are candidates for a football team and we’re saying假设这组人均为足球队的候选人 so,you know, are they gonna be on the team or not?那么哪些人可以入队 哪些不能呢? So this is “yes”.这里填“通过” No John’s made it, yes“淘汰 ” 约翰可以 “通过” no, no and so on“淘汰”“淘汰”诸如此类 and that way we could perhaps analyze our decision-making process and decide you know也许我们可用该方法分析决策过程并做出决定 Is there any aspect of these things以该表为例 that inform our decision-making process as an example, right?这些数据是否有体现决策过程的方面? Now we always structure data in this way我们常常以这种方式组织数据 and if we don’t it becomes a huge problem不然就会出现很大问题 because you end up spending all this time因为最终你要花费所有的时间 formatting and trying to work out what’s what来摆放和试图解析数据的含义 and you know, why is John listed down并且为什么约翰列表是 the table or not across the table?从上往下而不是从左往右呢? And you know, nothing makes any sense anymore这样一切就都乱了套 So let’s look at an actual data set接下来让我们看看真实的资料集 and we’ll see all this in action我们会看到操作中的所有步骤 So we have here a data set of whether someone这里是决定一个人是否去 goes to play tennis.Right?And打网球的资料集 是吧?而且 whether or not we go is gonna depend我们去不去 取决于 a little bit on what weather conditions are.那里的天气情况怎么样 So we don’t like to play for example when it’s too hot比如 天气太热的时候我们都不想打球 The tennis data set is网球资料集与 just the same structure as a data set we looked at already其他我们已经看过的资料集组织方式相同 We’re gonna load it into R it’s held in a CSV file.我们将把CSV文件中的数据下载到Rstudio So tennis read dot csv tennis输入tennis read.csv tennis now we’re using R for this because it’s free我们使用RStudio 因为它免费 and it has a load of decent functions for而且它在分析检验和 analyzing,examining,visualizing data.查看数据上都非常好用 So we’re going to be using it所以在所有的视频中 throughout these videos我们都会用这个软件来进行教学 obviously you could use MATLAB or Python你也可以用Matlab Python or some other library if you wanted to或其他编程语言 只要你愿意 I think that you should use whatever you’re most comfortable with我觉得一定要用自己觉得趁手的 Looking at these rows and tables来看看这些列表 I mean, it looks a lot like something like Microsoft Excel它看起来很像Excel表格 You could do this data analysis in Excel这份数据分析也可以用Excel来做 Some people would disagree.或许有人不认同 No, Excel is perfectly good for what it does但实际上Excel非常适合数据分析 you could do with data analysis in it.你完全可以用它来做数据分析 I think that我觉得 Excel in it doesn’t enforce anything to do withExcel没有强制执行任何 observations versus variables and things like that与观察值和变量相关的东西 These are distinctions that are not really made in ExcelRStudio和Excel还是有些不同 Obviously if you enforce those rules yourself that’s going to work,很明显 如果你想对这些数据强制执行其规则 but you have to be a little bit more,you know regimented and rule-based about it但你必须严格遵循其规则 I think the consensus would be that if you really want to get into data analysis如果你真的很想进行数据分析 and start doing things like principal component analysis or more并且开始做一些主成分分析之类的 Advanced statistical measures.高级数据测量 Something like R or Python is going to help a lot more.RStudio或者Python帮助会大一点 OK.So I’ve loaded the data set好的 现在我已经下载好了数据集合 and if we look up the data set如果我们来浏览 so we look at the top few rows of the data you’ll see that我们看一下数据的前面几行 there are 6 different variables or 6 attributes.有六个不同的变量和六个属性 And this data set has 14 instances or observations这个数据集合有十四个案例或观察项 R calls them observations.RStudio称之为观察项 So what we’re saying is we have six columns and所以这个数据集合有六列 fourteen rows,right,of our data set and this data set is14行 而且 structured exactly like组织结构上跟我 this people data set that I was looking at a minute ago前一分钟看的那个人物数据集一模一样 So we can examine a single instance,所以我们可以检验其中一个案例 we can say what is it about day three?我们可以说第三天怎么样? So let’s have a look at day three so we can say tennis on day 3让我们来看看第三天 输入tennis[3,] And we can say on day three it was overcast. The temperature was only five degrees原来第三天天气阴沉 温度只有五度 The humidity was high there wasn’t any wind空气湿度高 无风 so they decided to play tennis, right?所以我们决定打网球 对吧? So it’s a bit chilly, but I guess they gave it a go天气有点冷 但我猜他们还是去了 So on we could also look所以我们也可以看看 at all the different temperatures,所有不同的温度 for example, all the different forecasts.比如所有不同的天气预报 tennis.outlook输入tennis.outlook All right.And we can look at好的 我们可以看到 all the outlooks in the data set so we can say资料集中显示的所有天气状况 we’ve got sunny sunny overcast rainy rainy rainy and so on有 晴 晴 阴 雨 雨 雨 等等 and we can get a feel for what kind of weather we’re looking at here as well我们也可以看到那里的天气怎么样 using something like R.只要使用RStudio You can examine the instances你可以检测一下案例 You can examine the individual attributes你可以检测单个属性 you can group them together or not as you see fit你可以将他们分组 只要你觉得合适就行 and then you can start to drill into what this data set means然后你就可以开始研究这些资料集是什么意思 Now this dataset has in it the final column这些资料集的最后一栏 which is whether they actually played显示了他们到底有没有去打球 so you could use something like machine learning所以你可以使用机器学习之类的 to predict that final column based on the other columns.基于其它列的数据来预测最后一列的数据 That’s something you could do.就这样做 One aother thing about this dataset quite interesting is有趣的是 it has a few examples of the different kinds它有几个我们之前找到的 of data we were looking at earlier不同种类数据的一些样例 So remember we have nominal还记得吧 称名数据 ordinal interval and ratio有序数据 区间数据 比值数据 So for example,所以 比如 outlook is really a nominal field天气是一个称名数据场 Right?it’s a nominal data type对吧?属于称名数据类型 You could perhaps suggest that you could order it from rainy through to sunny,你可能可以按照雨 晴 阴的 but then cloudy overcast, you know顺序进行排列 It doesn’t really make any sense,so this is kind of nominal但并没有意义 所以这是称名数据 you could calculate for example the mode and say你可以计算出 比如 众数 你可以说 that most of the days were rainy or something like this出现最多的天气是雨天或者晴天什么的 Temperature as we discussed before温度 正如之前讨论过的 this is in Celsius. So this is going to be interval单位是是摄氏度 所以它是区间数据 we can order the data and we can say我们可以将它们进行排序 one of them is 15 away from another one也可以说这个数和那个数之间相差15 But we can’t say how much of a difference that it’s like.但他们的差异程度我们却无法说明 Is that double the temperature or half a temperature?它的温度是它的一倍 还是一半儿呢? We can’t really say.说不清楚 so humidity is ordinal,所以湿度是有序数据 so we can say high is more humidity,所以我们可以说“高”是更加湿润 even normal, right?甚至是正常湿度 对吗? But we can’t really say how much,但是我们无法说明 that’s going to depend on who was measuring it这要取决于测量的人员 and where their differences lie和数据的差异点在哪 and finally wind in kilometers per hour.最后 风速的单位是每小时几千米 Well, zero is no wind.0就是无风 Yeah, you can’t have negative wind.毕竟不会出现负值的风 So this is a ratio, right?所以这是一个比值数据 对吧? You can say that 20 mile an hour wind or 20 kilometers an hour wind,你可以说每小时20公里(km) is two times more than ten是每小时10公里的一倍 That’s something you can say,这样说是没错的 this little dataset contains all the kinds of data这些小小的资料集中包含了所有种类的数据 so the different statistics and measures you can calculate using these,所以不同的统计方法和测量方法 it’s going to depend on what kind of data they are计算的方式也不同 这取决于数据的类型 So we can see that even a very simple data set like this所以即使是像这样非常简单的资料集 has loads of different kinds of data and different ways we could interpret this data都有许多不同的数据类型和解读方法 Right, if you make a decision to play对吧?如果你仅仅看天气 based only on whether the Outlook is good是否良好来决定是否出去玩 You’re maybe not going to solve the whole problem, right?你可能无法解决所有的问题 对吧? So these are the kind of things we’ll be looking at as we go forward所以这都是我们做决策前要考量的因素 And one thing we might do next is to visualize this data.然后接下来就把数据可视化 Start to try and understand some patterns or extract some kind of knowledge开始尝试理解一些图案或者提取一些知识 They’re very important tool but you’ve gotta use it properly这些工具非常重要 但要适当使用 You can’t just plot anything and everything它不能用来谋划所有的事情 Every chart you use has got to support your hypothesis你用的每个图表都会用来支持自己的假说 or it’s got to try and show the story you’re trying to tell right?或者试图展示你要诉说的故事 对吧? You don’t just plot something你不会仅仅因为这件事你会做 because it could be plotted right?就去勤勤恳恳的做了 对吧? There’s got to be a point.肯定有原因 There’s a lot of problems with using inappropriate graphs使用的图表不当 采纳数据时断章取义 and only picking subsets of your data.都会造成许多问题 That’s a huge problem而且问题不小
  • 2021-08-2412/44 字符串格式化示例《字符串格式化代码》 Let’s see how we can do that string format.让我们来看看如何将字符串格式化 So I’m going to start off by doing first_name equals Christopher那么首先 和之前一样 我设first_name为Christopher and last_name equals Harrison, just like before.last_name为Harrison Again, what we could do is,然后我们可以这样做 we could just simply output equals简单地把output的值设为 our ‘Hello’ and our little plus and our first_name, ‘Hello’ 加号 first_name a little plus, a space,加号 空格 and then hello and—oops—and then our last_name at the end there.然后是’Hello’ 嘴瓢了 最后加上last_name There we go, and then I say print and our output,另起一行 输入print output and that will of course give us, our…现在我们就能得到… if I run this correctly, there we go.运行正确的话 结果会是这样 That will give us our little”Hello, Christopher.”结果为 Hello, Christopher Now, I’m going to do the exact same thing multiple times现在 我将多次重复相同的步骤 and we should just get the exact same output.那我们应该得到相同的结果 My goal here is really just我的目的只是 to show the different ways that we can do this.教大家一些不同的方法 So let me comment that out, and now,现在我把这条代码改为注释 let’s do this again by using the little placeholders.这次我们用一些小的占位符达到效果 So we’ll go ahead and say, “Hello”,我们还是以Hello开头 and then we’ll just do two curlies,然后输入一对大括号 and then two curlies,再输入一对大括号 and now I’ll say format.现在输入format Again, using tab to give me that completion there,之后按下tab键让它自动补全 and I’ll say first_name and last_name, there we go.再输入first_name和last_name Let’s go ahead and save that, rerun it,我们继续 保存 再次运行程序 once again, the exact same output, “Hello, Christopher Harrison”.依然输出了相同结果“Hello, Christopher Harrison” If I want to maybe be specific about it,如果我想要输入得更具体一点 so let’s go in and say 1 and 0, and go ahead and run it.可以在2个大括号内分别输入1和0 然后运行 Now, you’re going to notice we’ll get the exact same output,你会注意到我们依然得到相同的结果 but maybe if I wanted to reverse that last_name and first_name,但如果我想将first_name和last_name的值对调 maybe I want to put a comma between them,可能我还想在它们中间加一个逗号 so I’ll go in and I do 1 and then 0,那么我可以将1和0反过来 put a comma in the middle.并在2个大括号中间输入一个逗号 Now, what you’re going to notice when I run this is that,现在 运行后你会看到 it went ahead and gave me Hello Harrison,Christopher,结果是“Hello Harrison, Christopher” without me having to change the order of the parameters,我无需改变参数的顺序 which is another advantage to going in and specifying the numbers.这就是指定数字的另一个好处了 Now again, my preferred mechanism is just to use that cool little format string.重申一次 我最喜欢直接用这个很酷的小格式串 So I’m going to say output equals and then we’ll go F,我会输入output=f and now I’ll say,之后输入 Hello, comma, curly, and then we’ll say first_name, space, curly, last_name, andHello, {first_name} {last_name} you’ll notice the intellisense actually在大括号中 你会注意到 comes right along for the right there,右侧会有智能提示 suggesting first_name and last_name for me,提示我输入first_name和last_name and now I’ll go ahead and hit “Save”,我现在点击“保存” and I rerun that,然后再运行一次 and now we’re back to that Hello, Christopher Harrison.我们又可以看到“Hello, Christopher Harrison”的结果 So the big takeaway that I want you to所以从这些例子中 get out of all of those examples is that我最想让你们知道的是 there are multiple ways that you can do你可以用多种方式来得到 the exact same types of string concatenation.同一个字符串结构 It’s really up to you to decide which way it is that you prefer.实际用哪种取决于你更喜欢哪种 For me, like I said previously,就像我之前说过的 对我来说 I really like that last string format我真的很喜欢最后那种字符串格式化的方式 with the little f at the very beginning there.在最开始输入一个小写的f That to me reads the best,这个对我来讲是最好的方法 and you’ll also notice that there’s a lot of你以后也会注意到 other programming languages有许多其他的程序语言 that have a very similar construct such as in say它们都有非常相似的结构 比如说 JavaScript or in C sharp, for example.JavaScript 或是C# I also really like the fact that it is self-documenting,我也很喜欢自文档化的编程语言 because I can very clearly see, oh,因为我可以很清楚地知道 that’s where first_name is going tofirst_name会在这里 be and that’s where last_name is going to be.last_name会在那儿 That’s how we can work with strings inside of Python.Python中的字符串编程语言就是这样操作的
  • 2021-08-24为什么我辞掉了谷歌10万+的工作?我辞去谷歌工作的原因 我在蒙特利尔的谷歌公司工作 工作了约一年 是一名全职的软件开发者 我很喜欢那份工作 但是我决定辞去这份工作 集中精力运营我的油管频道 当我把这件事告诉我朋友时 他们有些人说 “你这是在追随你的爱好” “这很棒啊” 但是说真的 那不是我的目的 我喜欢制作油管视频 这毫无疑问 但是我辞去我的工作 不是为了追随我的爱好 如果我要追随爱好的话 那我可能现在都是单人脱口秀演员了 怎么了 我喜欢喜剧啊 嘿嘿 但我离开谷歌时 我没有那么做 而是密切关注我在求职市场上发现的一种落差 那么当我在谷歌开始工作前不久 我创立了我的油管频道《开发大师》 主要是关于软件工程师的求职面试 我的频道开始有了很多人关注 我的视频获得10万多的点击量 我的账号也有5000多的订阅者 那时我的频道只有10个视频 与此同时 我也开始指导他人 帮助他们准备软件工程师的工作面试 我也因此获得酬劳 从这些经历我意识到了 我开始帮助人们减小 理想和现实之间的落差 这是求职市场中很常见的一种落差 人们希望看到高质量的软件工程师工作面试视频 但他们无法在别处找到这些视频 这也就是为什么我的频道获得这么多点击量 以及订阅者的原因 有些人还希望有人可以一对一指导他们 帮助他们准备软件工程师工作面试 但是他们也很难找到这样的人 这也就是为什么他们找到我来帮助他们 因为我开始帮助人们减小理想 和现实之间的落差 我才能短时间使我的频道受到关注 通过我的频道 我能帮助成千上万的人们 能帮助这么多人 真的让我感觉非常棒 我又想到 如果我能全职运营我的油管频道 那我应该能给更多的人提供更有效的帮助 这才是我离开谷歌的原因 事实上 我现在的收益远比不上之前 但是看到自己 帮助了他人的生活 这真的很棒 这也就是软件开发师工作中 无法体验到的一些东西 尽管在谷歌工作确实很棒 我不太清楚我的频道以后会如何发展 但我认为它还是有很多可能性的 我还可以帮到其他的地方的更多人 你知道吗 我真的觉得很幸运 能够了解这么多人的生活 所以 我向你保证 我会一直运营我的油管频道 这样我就可以帮助更多的人 好了 希望你们喜欢这个视频 下一期我们再见 再见 来击个掌吧 想看更多这种视频吗? 那就给这个视频点赞吧 有任何想看的视频类型吗? 在评论下方告诉我吧
  • 2021-08-24没心情做事怎么办?作为一名自由职业工作者 我发现我很难激励我自己做我应该做的事 大概在两年前 自从我辞掉了全职工作之后 我觉得我很难再有动力完成任务 但在那段时间 我尝试了很多能让我自己更高效的方法 最后我发现几件对我来说有效的方法 所以今天我把它们分享给你 我列了七个步骤 让你即使不喜欢这件事也能做完 如果你感觉做某些事情没有动力 你首先要做的就是尝试改变你的环境 所以如果你不喜欢在家工作 可以尝试去咖啡厅 图书馆或者你的学校 对于我来说 我发现在家没有动力了 我只要一去我最喜欢的咖啡厅 我的动力就回来了 在那里完成任务也会轻松的多 所以我想说 尝试找到一个让你工作效率最高的地方 无论它在哪 一旦你找到了好的环境 一个合适工作的地方 下一步你应该做但还没有做的就是 你需要写下这一天你需要或者想完成的任务 无论任务是什么 记录的方式并不重要 所以你用什么都可以 比如印象笔记 其他日程软件 或者仅仅一张纸 我个人喜欢用谷歌云笔记 因为我觉得安装和使用都很简单 无论用什么方式 只要记录下来就很有用 这样你可以不用记着这么多东西 从而使你更集中 之后如果你还是觉得没有动力做事情 我就会先完成小任务 这些事情可以非常简单 比如回邮件 或者一些有趣的事情 比如计划你的假期 无论任务是什么 只有完成了这些小任务 你才能进入可以做大任务的思维状态 当然 不要花费太多时间在不重要的小任务上 之后 假设你有一个重要任务需要完成 可以是任何任务 假设需要完成一篇博客日志 如果你还是没有动力 首先你就要问自己为什么要完成这件事 通过问自己为什么要花这么多时间 来进一步发掘自己的动力 所以在这种特定的情况你可能会说 我想写博客日志 因为我可以通过我的博客赚点钱 你可能会说 我想赚点外快 那么将来我就可以多去旅行 你可能会说 旅行对我来说很重要 因为我想在将来体验更多的东西 所以这就是你在这个特定情况下的动力 当然 这对你来说可能就不太一样了 但无论是什么 如果这件事很重要 你就要开始完成任务 获取动力 之后 如果你确实觉得某个任务对你来说很重要 最简单的方法就是把它分成很多小任务 比如这件事情是写博客日志 首先写第一段 或者写大纲 如果这些对你来说还是太多 那就写一些这篇文章的重要关键词 无论你的任务是什么 你都要把它分成几个小步骤 这样就很容易开始做了 把任务分成小步骤之后 如果还是感觉难以入手 那我就会这么做 我会告诉我自己 这件任务只能做五分钟 你可以用手机的计时器来计时五分钟 如果五分钟太长 那就两分钟 如果两分钟太长 那就一分钟或者三十秒 无论你的工作多无聊多痛苦 你都应该能完成一分钟 当一分钟结束时 问问自己 可以再完成两分钟 四分钟 十分钟甚至二十分钟吗 这样可以逐渐增加专注于这个任务的时间 顺便推荐这个APP:forest 因为我觉得这比一个纯计时器给自己计时更有趣一些 最后是我推荐的最后一种应对不想完成任务的方法:奖励你自己 奖励可以很简单 比如 吃一份很早就想吃的零食 看一看刚播出的你最喜欢的节目 无论奖励是什么 你都要告诉你自己 只有完成任务才能领奖励 这就是我完成任务的七个步骤 但还有另外一件我必须谈的事情:音乐 我做事情的时候一般不会听歌 因为这样的话我才可以集中精力 但好听的音乐可以帮助你做一些无聊的事 比如法律文件或者和任务相关的工作 听着音乐完成任务就不会这么无聊了 这就是我想说的全部内容 如果你们想看其他类型的视频或者有想让我说的主题 请在评论区告诉我 感谢观看我的视频 下一期再见
  • 2021-08-24什么是机器学习?你们发现了吗 最近智能手机越来越能听懂你们说话了 它们以前很烂的 OK谷歌 打电话给我妈 好的 正在打电话给沃尔玛 不是吧 但因为机器学习领域的进步 现在好多啦 《什么是机器学习》 简单来讲 机器学习就是 让机器去做一些基于大量数据的智能的事 比如教它们完成一些复杂的任务 不是那种仅靠写下所有规则就能做的事 例如驾车 它还可以被用来挖掘大规模的数据集 找到我们人类未必能够找到的数据规律和关联 我发现大家可能都听说过机器学习这个词 但不知道它是什么意思 但这其实是件好事 因为我们几乎每次使用手机或者上网 都要跟机器学习算法产生互动 它某种程度上支配着我们生活中的很多事 所以了解一下它能做什么是很好的 智能手机能听懂人话就是 机器学习应用的一个例子 但还有很多很多其他例子 比如 网飞和亚马逊推荐其他你可能想买的东西 你的邮箱能识别垃圾邮件 Snapchat的人脸滤镜 谷歌和脸书基于你的使用数据 把你标记为他们广告的目标用户 分析股市 找到好的投资组合 邮局能认出你手写的地址 挖掘用户位置信息 告诉你哪里比较堵 自动驾驶汽车能认路和识别路障 亚马逊的仓库机器人能精准定位库存 军队通过观察卫星数据找到隐匿的建筑物 谷歌从他们的街景图片识别我们的门牌号 在医学领域 通过医学影像诊断肿瘤 甚至是魔术棒和一键修复这样的修图工具 NASA分析望远镜里的数据 寻找太阳系外的星体 或用火星漫游者绘制路线 还有很多 很抱歉对你们信息轰炸了 这些足以说明很多并且越来越多领域 正在用机器学习解决问题 但这是怎么做到的呢? 我来解释一下 以语音识别为例 假设你想要通过编程 让一台电脑能把语音转换成文字 你可以坐下来试图分析 音节组成词语的规律 但这是不可能的 你永远也分析不出来所有组成方式 尤其是涉及到不同语言和口音的时候 所以更好的方法是 让电脑运行学习算法 大概意思是将大量的 人们对话的录音以及对应的文字输入该算法 即录下大量的对话 然后人工听写录音内容并进行注释 然后把这些输入该算法 顺利的话 该机器学习算法就能 学习到不同音节和词语之间的规律和关联 当机器算法完成了训练 你就能输入一些没有注释的对话录音 它应该就能输出该对话的文字内容 但是训练这些机器学习算法绝非易事 想要一点不出错太难了 太多的参数 你需要了解很多技能和技巧才能训练好 全球范围内的大公司都花了很多资源 在优化机器学习算法上面 因为他们都在互相竞争 脸书 谷歌 亚马逊 它们都在互相竞争 而最前沿的竞争点就是机器学习 所以当下了解一下这个其实是很重要的 最好的机器学习算法能处理好大多数任务 但总有一些极端的事它们没法处理 比如即便最专业的语音识别算法 可能也不能识别所有英国方言 新裤子多少钱一条 我那旧的到处都是洞 失败! 本视频只是对机器学习的介绍 用的例子只是其中一类 如果还想了解更多记得告诉我 这主题我可以做很多 任何问题记得在下方评论 也许我最近会再做一个关于这个的视频
  • 2021-08-24二进制到BCD编码(double dabble算法)In some ways there are some awkward incompatibilities我们平时喜欢使用的十进制 between the decimal that we like to use and the binary,和对计算机来说更为高效的二进制之间 which is so efficient and wonderful for computers to use.在某些方面上的兼容性有些尴尬 We’ve seen one example of this already.我们已经看过一个这样的例子 I think I’ve mentioned it in a previous video我想我在以前的一个视频中提到过 that 0.1, or 0.10, ten cents in other words, in decimal十进制中的0.1或0.10 in your current bank balance也就是银行账户中的10美分 is not exactly representable as a binary number.不能用一个二进制数字精确地表达出来 You look at what it is in binary看看它在二进制中是什么 0.000110011…0.000110011…… It just isn’t, you know. It keeps on recurring.它并不是精确的 它一直在循环 Just as 1/3 in decimal isn’t exact,就像1/3在十进制中不能精确表达 it goes 0.333333… forever.它是0.3333333……一直循环 Decimal and binary sometimes don’t mix.十进制和二进制有时候不能混为一谈 Now here’s going to be a classic example.现在来看一个经典的例子 Let’s just think this through, without even writing anything down.我们可以只用脑袋想 甚至都不用写下来 If we’ve got a 4-bit nibble,假设我们有一个4比特的半字节 we know that in hex it goes from 0000 to 1111,我们知道十六进制的范围是从0000到1111 15 represented as f.f代表15 Can’t use that full range.但这个范围并非全部可用 This is binary-coded decimal,因为这是二进制编码的十进制数 not binary-coded hexadecimal. Yeah?不是二进制编码的十六进制数 对吧? We have got to say,你必须明白 the moment that representation, even in one nibble,即使是在半字节中 只要数字 gets to 1010, that is 10.变成了1010 那就是十进制的10 You can’t leave it as 1010.你不能把1010放着不管了 You’ve now got to have 2 nibbles:现在 我们来看两个半字节 the left nibble with a 1 in it左边的半字节里是1 and the right nibble wants to look like 0.右边的半字节里是0 You can’t compress it into a single nibble,你不能把它压缩在一个单独的半字节中 say “1010”说它是1010 I’m sorry folks, you’ll have to learn hex,我很抱歉 朋友们 你们必须学习十六进制 because otherwise you won’t understand your bank balance.否则你就看不懂你的账户余额 This is not going to go down very well.这样可不好 The challenge then is, if you’re using a 4-bit nibble,问题来了 如果你要用一个4比特的半字节 but only for the decimal range 0 to 9,来表示十进制里的0~9 you’ve somehow got to make, in all your bit twiddles,你必须注意 在所有的位运算中 you’ve got to make it carry into another nibble要保证在数字变成10的时候 on the left at 10,在其左侧插入另一个半字节 and not at 16,而不是在16的时候 which is what hexadecimal will do for you. 那是十六进制运算会做的 So how does one, sort of, bridge that gap?那么应该如何消除两者之间的差异? Probably the best way for me to在我看来 解决这个难点的 get in to the hard bit about this最好方式 is go straight away for that magic number 10.是直接来看一个神奇的数字10 Let’s represent it in binary我们用二进制来表示它 and then say: How do we convert it into BCD?然后想想:如何把它转换成BCD码? and realize that we need that second nibble on the left.然后你会意识到 我们需要在左边插入第二个半字节 What I’d like to do here is to draw myself columns,现在我会画几个列 and I’m going to restrict myself to things that are, at most, 2 decimal digits.然后将其限制在至多两位十进制数 Let’s remind ourselves, up here,我们来回想一下 在这里 that we’re going to have a tens nibble我们需要一个十位数的半字节 and a units nibble,和一个个位数的半字节 and we’ll initialize everything to zeros.将它们的初始值都设为0 But, above here, just to keep things very simple,但是 这里为了简化 we’ll use 4-bit binary representations.我们使用4位二进制表示法 And I hope you will all agree that我希望你们都能接受 1010 is ten in base ten.1010是二进制里的10 And the reason for that is that the binary in 4 bits由于我们用了4比特的二进制 that’s an 8, that’s a 2, so that’s 10.这是8 这是2 加起来就是10 This technique which is called Double Dabble这就是Double Dabble算法 I don’t know how it was discovered,我不知道它是怎么被发明的 it’s fiendishly clever.这太聪明了 But the idea is, we’d love to convert binary into BCD它的原理是 我们要把二进制转变为BCD码 by as far as possible尽可能地只使用 using simple bit shifts all the time,一些简单的位的移动 and doing the minimum of mucking about,同时尽可能地减少操作 to get it to carry early.尽快得到正确结果 So the “Double” reflects the fact所以“Double”指的是 that we’re gonna shift this bit-pattern across into here.我们要把位组合移动到这里 And we’re going to regard it as one huge, great, big, 12-bit register here.我们将把它视为一个巨大的十二位的寄存器 A walloping, great shift register, all joined together.这个巨大的移位寄存器组合在一起 Even though I’ve drawn it out separately,虽然我将它分离出来 it’s just gonna move from the right across to the left, and move them across.但它会从右边移动到左边 都移过来 And remember, every time you shift the thing by one place left别忘了 每次你向左移动一位 you are basically doubling it.它实际上是加倍了 Ok, that’s where the “Double” comes from.这就是“Double”的由来 But we find we have to intervene to make it look right at the end,但我们必须采取干预措施 以得到正确结果 and that is where the “Dabble” comes from.这就是“Dabble”的由来 If you look up “dabble”, as I did in the Chambers Dictionary,如果你像我一样在《钱伯斯词典》中查过dabble one of the meanings is, ‘”to make trivial alteration to”它的其中一个含义是“对某事做一些不重要的改变” OK. To make a small alteration to something,当你对一些东西做小小的修改 you’re “dabbling” with it.就是在“dabble”它 OK, so that’s where Double Dabble comes.这就是Double Dabble名称的由来 Ok, so it’s basically doubling,这基本上就是把数字加倍 with a little bit of dabbling.再做一点点微调 And the truth really hits you at 10.看看十进制的10 你就明白了 So let’s progressively shift this by 1 bit left.我们把这些数字依次向左移动1位 What’s gonna happen first of all?首先会发生什么呢? You shift over that 1 bit.你移动了一位 You push it across into here你把它移动到这里 because this is a unified register for the moment.因为此时这个寄存器是一个整体 Purists what would say: “Ah! But when you shift left, like that,纯粹主义者可能会说:当你像这样向左移动时 you should fill in with zeros on the right.”你应该用0填满右边的空白处” Yes, that is what will actually happen, inside the hardware,是的 在计算机硬件中这确实会发生 But I prefer not to pad with zeros on the right as I shift,但我不想在移动时用0填满右边的空格 because I want you to see when I’ve finished.因为我希望你能看到最终的结果 So we could call this shift No.1.我将其称它为第一次移动 Let’s do another one.让我们来做下一个 That one moves into that position,这次移动到了这里 but you’re bringing over another 0 out of that part,但是你要把另外一个0从那一部分移过来 and that’s leaving you with 10 in there.那里会剩下1 0 Now notice what’s happened.现在看看 发生了什么 On shift 1 here,在第一次移动后 you had a 1 of the right, in that nibble.在你右边的半字节中会有一个1 By the time you’ve shifted it left one place,当你将它向左移一格 it’s in the 2’s position.它就在十进制2的位置上 So you’ve doubled it. Let’s do shift 3.所以说它加倍了 我们来做第三次移动 And a zero is left,这时还剩一个0 so, that is shift 3.这就是第三次移动 Now, this is where we can begin to see trouble on the horizon.现在我们就开始能看出问题了 We have got one more shift left to do,我们还要再向左移一格 and if you don’t do anything about it,如果你不对它进行任何操作 it’s just gonna end up with 1010 in here.它将以1010结束 I mean, all right, what’s happened here, look,我的意思是 看这里发生了什么 is that was two. You doubled it.那本来是2 你把它加倍了 But because you shifted a 1 in and not a 0, you’ve doubled it and added 1.但因为你移动的是1而不是0 它在加倍后又加了1 That now says 5, OK?现在这是十进制中的5 对吗? So basically it’s doubling.所以基本上 它是加倍了 But sometimes if the bit you shift over is 1但如果你移动的数字是1 and not a 0, it’s double and add 1.而不是0 它就是加倍后再加1 But essentially it’s doubling.但从本质上说 它是加倍了 Now the trouble is coming on the horizon as I can see我现在可以看到的问题是 that if I just push that 0 bit over here,如果我继续把这里的0移过来 I’m going to end up with 1010,我最终将得到1010 and I know, it’s 10. Fine!我知道 这是十进制里的10 But no, it’s hexadecimal!但并不是这样 这是十六进制! It’s not representable as a digit from 0 to 9.它不能用0~9表示 So, what should you do then?那么 你接下来应该怎么办? Let it happen anyway and then look at it如果什么都不做 and say: “Oh my golly, it’s gone to ten你就会发现:天呐 到10了 it’s gone to eleven; it’s gone to fifteen even!到11了 甚至到15了! I’d better backtrack and undo it and then redo it?” No.我要不撤销再重做一遍? 不用 Dive in early and reason as follows:及早发现问题并按以下思路分析 Concentrate everybody! OK, what we want here大家注意力集中!好了 在这我们希望 is for this thing to come out looking like它的最终结果是 0001 0000.0001 0000 Let’s say that’s the desired result.这是我们想要的结果 Because that regarding these as BCD digits,因为这些是BCD码 that’s 1 0, ten.那是1和0 十进制的10 That’s exactly what you want.恰好是你想要的结果 So how do we make that happen?所以怎样才能得到这样的结果? How do we make it carry over into this left-hand nibble here,我们怎样使它移到左边的半字节中 when it doesn’t want to at the moment.而不产生任何问题 So the fiendishly clever thing says:所以有句很妙的话是这样说的: Take a look at what you’ve currently got看一看你现在有什么 because if what you got is 5 or more,因为如果你现有的是5或更大的数字 the act of doubling it, it’s bound to get you into a number把它加倍 你一定会得到一个 that needs to carry across.需要转换的数字 So if it is going to cause you trouble, at 5 or more,所以如果在5或更大数字的转换上有问题 we wanted to carry at 10,我们希望能在10进位 it innately would like to carry at 16 and you don’t want that.它本可以在16进位 但你不希望这样 David: What’s the difference, Sean, between 10 and 16?肖恩 10和16之间差多少? Sean: 6 David: 6. what’s half of that?– 6 – 是6 它的一半是多少? Sean: 3. David: All right. So, if we add three,– 是3 – 是的 所以 如果我们加上3 the fact that we’re then shifting it通过移动 will double our 3 contribution to 6,会把3加倍成6 and we’ll make it carry.我们就需要进位了 So the rule is, on Double Dabble,所以在Double Dabble算法中 if what you see in your nibble如果你在半字节中看到的 is 5 or more, then add 3.是5或者比5更大的数 就再加3 So, here we go, look. Next stage now.那么我们进入下一步 Because we’ve seen trouble on the horizon.因为我们发现了错误 It’s 5, so add 3.这是5 那就加上3 And 3, we agree, is 11.我们都知道3是11 Now, here you do have to do a little addition with carries.现在 你应该在转换过程中做一些加法 You can’t avoid it. Some carries will have to take place.你不能回避这个问题 必须要进位 One and one is zero. Carry one.1加1是0 进一个1 One and one is zero. Carry one.1加1是0 再进一个1 One and one is zero. Carry one.1加1是0 进一个1 The act of adding 3,加3这个步骤 will make it look not like 0101,会让它不再是0101 you’ve added the 3,你加了3 it now looks like 1000. Magic!它现在就是1000 好神奇! But, what happens when you shift the final 0 in?但是 当你将最后的0移进来后会怎样? That 1 will shift left, into the left-hand nibble.这个1将会移动到左边 移到左边的半字节中 And you’ll end up with: 0001 0000.然后你将得到:0001 0000 And this thing is now empty.这个半字节里就没有东西了 So you know you’ve come to the end of your conversion.这你就完成了转换 It’s so cool. I love it dearly.真是太酷了 我好喜欢它 You could argue though,虽然你可能会认为 the one problem with all this is这有个问题 that in order to do your shifts quickly, you’ve got this in a为了更快地移动 你应该 sort of a unified shift register full of bits.用统一的移位暂存器来移动这些字节 Your nibbles – in the end – end up looking correct.你的半字节最终看似正确 But you’re gonna have to dig them out of the shift register.但是你必须把他们移出移位暂存器 Oh yeah! It’s clearly.这样就很清晰了 That’s a 4, yes. That’s a 2, isn’t it? Magic!那是4 那是2 是吧?好神奇! Of course, if you’re using this seriously,当然 如果你仔细地运行这个过程 you have to try and generate these你需要尝试更好地把它们 BCD digits转成BCD码 in a way where they don’t necessarily need digging out of a bigger representation.通过一种无须将它们移出移位暂存器的方法 But on the other hand you’re using that behind the scenes.但另一方面你们确实用了这种方法 I’ve found, for you, the ultimate reference我帮你们找了一些参考书 that I’ve taken this example from, and used the methodology.我举的这个例子就来源于这本书 也用了书里的方法 It’s by a guy called Chuck Falconer.它的作者是加克·福尔特纳 It’s actually referred to in the Wikipedia articles维基百科上有关BCD码和Double Dabble的文章 on BCD and Double Dabble.引用了该书中的内容 So we’ve pulled that over. It’s freely available.我们把它收录了进来 它是完全免费的 You can go and dive in there to your heart’s content,你可以好好地看一下它的内容 because he covers about how to make them [the nibbles] appear因为他谈到了关于如何使用半字节 in a much more useable way.让它发挥更大的作用 And what he also says is同时他也提到 that when you start looking at this, you realize如果你仔细思考 你会发现 you are actually doing实际上 你是在做 the “division by ten and remainders” thing that we discussed.我们讨论过的“除以10和余数” But you’re doing it in a pretty efficient way但你用了一种很高效的方式 and only occasionally needing that little addition of 3.只需偶尔做一下加3的小加法 So that’s, I’m not saying there aren’t other ways我并不是说没有其它的方法 This seems to be endless variants on this.还有数不尽的方法 there’s signed BCD; there’s packed BCD; there’s all sorts.有符号BCD法 压缩BCD法 还有其它很多 But if you just want to understand the fundamentals,但如果你只是想明白基本原理 I would say go through the 42 example,那我认为你要看前面关于42的例子 then go to Chuck Falconers memo.和加克·福尔特纳的笔记 He does 255, as decimal,他做了十进制的255 and boy that needs spotting problems我的天 那需要从三组半字节 in about three sets of nibbles – not two.而非两组半字节中找到问题 You have to spot one in the middle thing happening and so on.你要关注运算过程中的问题 Sean: You mentioned 255, so this goes up to hundreds, thousand…?你提到了255 所以它可以上百 上千? David: Yes, yes. Sean: you just add more…?-是的 是的 -你需要加… You add more BCD digits on the left to cope.你要在式子左边添加更多的BCD码来进行运算 But you give yourself a bigger problem但是你给自己制造了一个更大的问题 When examining each of these digits当你检验这些数字时 to see if they’re about to go beyond ten, when they’re doubled,通过再一次向左移动 观察当它们加倍时 by shifting left one more time.是否会超过10 You give yourself a bigger and bigger inspection task.你的检验过程会越来越复杂 There’s no question.这是毫无疑问的 So, like I said如我所说 the Chuck Falconer memo from which this is derived,我说的内容来自于加克·福尔纳特的笔记 we’ll put a link out to it. It is freely available.我们会放一个链接 它是完全免费的 He doesn’t explain how他没有解释 the people who invented this actually discovered it,发明者到底怎么发现它的 worked out that it really does work.是怎么研究它的原理的 It seems almost like magic when you do it.当你运算的时候 它就像是魔法 And then, every so often I pull out another number并且当我每次运算新数字时 我会想 and think I bet it won’t work for this这方法不行 But it does. It’s quite incredible.但是它可以 真是难以置信 So, there we are then. I think we’ve就是这样 我认为 fairly well summarized now what the situation is,我们已经很全面地做了讲解 that for great, big engineering, scientific calculations,内容包括庞大的工程和科学计算 even for finding new prime numbers as huge integers,甚至是寻找大整数形式的新质数 you really do need proper binary to speed things up.你肯定需要合适的二进制来加速整个过程 But for some sorts of trivial calculations,但是在做小型计算时 you might even want to do it in BCD all the time.你甚至可能会想一直用BCD码进行运算 But even if you are basically binary and want to print out your answers,但如果你想要用基础的二进制得出最终结果 you still have to convert from binary through to BCD.你还是需要将二进制转换为十进制 And that is always a worry for the people who write the I/O routines,写I/O程序的人 即写C语言之类的人 shall we say, for C and so on.常常会担心 Is this gonna be efficient?这会更加有效率吗? What we’re saying is,我们要说的是 at the computing end of things,在所有运算的最后 you should be able to prepare that BCD-digit stream as quickly as possible.你应该尽可能快地准备BCD码数字流 To finish off with them, life, the universe and everything,来总结一下 生命 宇宙和所有的一切 will this work for converting 42?它们对转换42有帮助吗? Yes, it will.当然 毋庸置疑 Now admittedly, here’s one I did early.其实 这是我很早以前做的
  • 2021-08-241千兆是多少个字节?Computers today, can hold a lot of information;现在的计算机可以存储很多信息 the terminology can get confusing.但相关术语可能让人困惑 Gigabytes,GB Terabytes,TB and what’s a Petabyte?还有PB都是什么? Let’s start at the beginning.我们一起来看下吧 This cube here, represents the smallest piece of information a computer can hold.这个立方体代表计算机的最小存储单元 It’s called a Bit.被称为比特 It stores two different states, or values.它存储两种不同的状态或值 You can think of this as a true or false,你可以认为是 真或假 on or off,开或关 yes or no.是或否 But most commonly, it’s thought of as a 1 or a 0.但大多数情况下 它被认为是1或0 Now let’s add another Bit.现在我们再加上1比特 By changing these two Bits to different combinations of ones and zeros,通过将这两比特变成1和0的不同组合 we can now store four different values.我们现在可以存储四个不同的值 Zero Zero, Zero One, One Zero and One One.00 01 10 11 This is how you count from 0 to 3 in binary.这就是在二进制中从0数到3的方法 Now let’s add another Bit,现在再加1比特 you can store 8 different values now.现在你可以存储8个不同的值了 Each time you add a Bit, you double the amount of information you can store.每加1比特 你所能存储的信息量翻一倍 Once we get to 8 Bits, we group them together.有了8个比特时 就把它们划分成一组 This is called a Byte.称为一个字节 It can store 256 different values.它能存储256个不同值 A Byte is also enough space to store a single character,一个字节也有足够空间存储单个字符 such as a letter, number or symbol.如一个数字 字母 或者符号 Add more Bytes, and we can store larger numbers,增加更多字节 我们能存储更大的数字 colors,颜色 and we can even put many characters together to form sentences.我们甚至能把许多字符放一起组成句子 We’re going to need a lot more Bytes to store something useful.我们将需要更多字节存储一些有用的东西 When you have 1024 Bytes, we call this a Kilobyte.我们把1024字节称为1KB Now you might ask, why the 1024?你现在可能想问 为什么是1024 Why not 1000?而不是1000呢? Hang tight, I’m going to talk about this towards the end of the video.等下我会在视频的最后讲这个 A Kilobyte is enough space to store a small text document.1KB已经足够存储一个小的文本文档 To store a low-resolution picture,要存储低分辨率的图片 you’re going to need somewhere around 100 Kilobytes.大概需要100KB Once you get to 1024 Kilobytes we call this a Megabyte.达到1024KB 就称为MB One minute of MP3 audio, is about 1 Megabyte.一分钟的MP3音频大概要1MB Those old 3.5 inch floppy discs can hold 1.44 Megabytes of information.那些旧的3.5英寸软盘可以存储1.44MB的信息 A high-resolution picture can be around 5 Megabytes.一张高分辨率图片大概要5MB A CD can hold up to 700 Megabytes一张CD可以容纳700MB 1024 Megabytes is called a Gigabyte.1024MB被称为1GB A DVD can hold up 4.7 Gigabytes.一张DVD能容纳4.7GB A Blu-ray Disc can hold up to 25 Gigabytes.蓝光光盘能容纳25GB A Flash Drive today can hold 32, 64 or 128 Gigabytes.如今的闪存盘可以容纳32 64或128GB They’ll probably have something larger by the time you watch this video.在你看这个视频时 可能还会出现更大的 1024 Gigabytes is called a Terabyte.1024GB称为1TB A Hard Drive will be anywhere from 1 to 8 Terabytes.一个硬盘的容量将在1到8TB之间 1024 Terabytes is called a Petabyte.1024TB称为1PB Web servers for companies such as Facebook,脸书 油管 亚马逊等公司的网络服务器 YouTube or Amazon will be measured in Petabytes.通常要以PB为单位计量 1024 Petabytes is called an Exabyte.1024PB称为1EB The amount of internet traffic every month,每个月的互联网流量 will be measured in Exabytes.将以EB为单位测量 After Exabytes, comes ZettabyteEB之后还有ZB and then… Yottabyte.还有……YB Well, you get the idea, so big, you probably won’t好吧 你明白了 YB实在太大了 hear to much about Yottabytes for a while…短时间内你的生活中不会出现YB maybe someday though…不过也许总有一天它会出现的 Ok, let’s get a birds eye view of all of this.好吧 我们来总结一下 8 Bits to a Byte,8比特是1B and 1024 for every jump after that.之后每1024进一级 So there’s 1024 Bytes to a Kilobyte,所以 1024B就是1KB and 1024 Kilobytes to a Megabyte and so on…1024KB就是1MB So why not 1000?那么为什么不是1000呢? Actually, by some definitions it is 1000.实际上 有时候是1000 But, most of the time we mean 1024.但多数时候我们说的是1024 Computers like the number 2.电脑喜欢数字2 If you multiply 2 by itself 10 times you get 1024.你把2自乘10次就得到1024了 This is why computers like that better.这就是电脑更喜欢1024的原因 So there you have it,现在你明白了 next time you hear one of these words,下一次见到这些词时 you’ll know what it means.你会知道它意味着什么 My name’s Jared Owen. Thanks for watching.我是杰瑞德·欧文 感谢观看
  • 2021-08-24数据结构:#3 链表概述In this lesson, we will在本节课 我们将 introduce you to linked list data structure向你介绍名为链表的数据结构 In our previous lesson,在先前的课程里 we tried to implement a dynamic list using arrays and我们尝试了使用数组来创建动态列表 we had some issues there这种创建方式存在一些问题 it was not most efficient in terms of memory usage, in terms of memory consumption就内存使用和消耗方面而言 它并非最高效的方法 When we use arrays, we have some limitations使用数组会给我们带来一些局限 To be able to understand linked list well,为了搞清楚链表的概念 we need to understand these limitations我们需要理解这些局限性 so i’m going to tell you a simple story所以下面我将向你讲述一个小故事 to help you understand this来帮助你理解 let us say this is computer’s memory假设这是计算机的内存 and each partition here is one byte of memory每个小隔间代表内存的一个字节 Now as we know each byte of memory我们知道内存的每个字节 has an address都有对应的一个地址 We are showing only a section of the memory,在这里仅展示内存的一小部分 that’s why it is extending towards the bottom and the top所以能看到它正朝着底部和顶部两头延伸 let’s say that address increases from bottom to top假设地址从底端向顶端递增 so if this byte is address 200,则如果这个字节的地址是200 the next byte would be address 201那么下一字节的地址就是201 and next byte would be address 202 and so on再下一字节的地址则是202 以此类推 what I want to do is现在我要做的是 i want to draw this memory from将内存从左到右水平排列 left to right horizontally instead of drawing it from bottom to top而不是由底部朝顶部排列 like this就像这样 uh… this looks better嗯 看起来更直观了 let’s say this byte here is address 200假设这个字节的地址是200 and as we go towards the right当我们向右进行遍历时 the address increases, so this is like 201地址依次递增 所以这个地址是201 and we go on like 202, 203 and so on接着便是202 203 以此类推 it doesn’t really matter whether we show memory from bottom to top or left to right关键并不在于我们应采用水平还是竖直方式展示 these are just logical ways to look at the memory这些仅仅是观察内存的逻辑方式 so coming back to our story现在回到我们的故事 Memory is a crucial resource内存是一种极其重要的资源 and all the applications keep asking for it.所有程序都在不断请求得到它 So, Mr. computer has given this job of managing the memory所以计算机将管理内存的工作 to one of his components, to one of his guys交给了它的其中一个组成部分 who he calls the memory manager我们称其为内存管理器 now this guy keeps track of这个家伙能够探测 what part of the memory is free and what part of the memory is allocated内存的哪些部分是闲置的 哪些部分是已被分配的 and anyone who needs memory to store something凡是需要内存来进行存储的程序 needs to talk to this guy都要跟这个家伙报备 Albert is our programmerAlbert是我们的程序员 and he is building an application他正在创建一个程序 he needs to store some data他需要在内存中存储一些数据 in the memory, so he needs to talk to the memory manager所以他需要跟内存管理器沟通 He can talk to the memory manager in a high level language like C, let us say他可以使用像c语言在内的高级语言进行沟通 he is using假设他使用的是 C to talk to the memory managerc语言和内存管理器交流 First he wants to store an integer in the memory首先他打算在内存中存储一个整数 so he communicates this to memory manager by declaring所以他通过声明一个整型变量 an integer variable将请求传达给内存管理器 something like this就像这样 the memory manager sees this declaration and内存管理器看到声明后 he’s says that ok就跟他说没问题 you need to store an integer variable你需要存储一个整型变量 so i need to give you four bytes of memory那我就给你内存中的4个字节 because integer variable is stored因为在典型架构中 整型变量 in four bytes in a typical architecture.以4字节为单位进行存储 and let us say in this architecture, it is stored in four bytes假设它在此架构中以4字节为单位存储 so the memory manager looks for那么内存管理器便在内存中 four bites of free space in the memory寻找可用的4个字节 and assigns it or allocates it for valuable x并将其分配给变量x Address of a block of memory is the address of the first byte in the memory一个内存块的地址便是块中首字节的地址 so let us say this first byte of memory here is at address 217, so假设这个内存块中首字节的地址是217 variable x is at address 217那么变量x的地址便是217 so memory manager kind of communicates it back to Albert that hey I have所以内存管理器将它返回给Albert assigned address 217 for your variable x将地址217分配给你的变量x you can store whatever you want there.你可以在里面存储任何事物 and Albert can fill-in any data into this valuableAlbert能将任何数据装入变量中 now albert needs to store a list of integers, a list of numbers现在Albert需要存储一列整数 一列数字 and he thinks that the maximum number of integers in this list will be 4.并且他觉得数列元素的数量上限是4 so he asks the memory manager所以他向内存管理器请求 for an integer array of size four names ‘A’一个长度为4的整型数组‘A’ Now, arrays is always stored in memory as one contiguous block of memory.数组总是以连续的内存块存储在内存中 So memory manager is like ok,所以内存管理器同意请求 i need to look for a block of memory of 16 bytes for this variable this array A.开始给数组变量A寻找16字节的内存块 so the memory manager allocates this block starting address two zero one and所以内存管理器将这个起始地址为201 ending address two one six for this variable ‘A’和尾地址为216的内存块分配给变量‘A’ which is an array of four integers.这就是具有4个整数的数组 uh… because array is stored as one contiguous block of memory因为数组以一个连续的内存块存储 and memory manager conveys the starting address of this block而且内存管理器传递了这个块的起始地址 whenever Albert tries to access any of the elements in the array所以当Albert尝试访问数组的任一元素时 Let’s say he tries to access,假设他尝试访问 let’s say he tries to write the value at the fourth element in the array.对数组的第四个元素A[3]进行写操作 which he accesses as A[3], Albert’s application knows where to write this particular应用程序便能知道从何处写入这个特定值 value because it knows the base address因为它知道基址 the starting address of the block ‘A’ the array ‘A’即块’A’或者说数组’A’的起始地址 and from base address using the index which is 3 here通过基址 根据对应值为3的索引 it calculates the address of A[3]就能计算出A[3]的地址 so it knows that A[3] is at address two one three. So, to所以能够知道A[3]的地址是213 access any of the elements in the array所以 不论访问数组的哪一元素 the application takes constant time程序总是占用恒定的时间 and this is one awesome thing about arrays that irrespective of the size of the arrays在不考虑数组大小的情况下这是很棒的事情 uh… the application, an application can access any of the elements in an array in constant time程序能在恒定时间内访问数组中的任一元素 now let’s say Albert uses this array of 4 integers to假设Albert使用带有4个整数的数组 store his list来存储这个列表 so i’ll fill in some values here at these positions, let’s say this is 8现在将一些值放到这些位置 就比如这里是8 this is 2, this is 6, this is 5, this is 4这是2 这是6 这是5 这里则是4 Now Albert at some point feels that ok, i need to现在Albert突发奇想 have one more element in this list打算给列表新增一个元素 now he has declared an array of size four他已经声明了一个长度为4的数组 and he wants to add a fifith element in the array现在他想在数组中加入第5个元素 so he asks所以他向内存管理器 the memory manager that hey i want to extend my array ‘A’请求说我想要扩展数组’A’ is it possible to do so这样做是有可能的吗 i want to extend the same block我想对同一个内存块进行扩展 and the memory manager is like内存管理器回答说 when i allocate memory for an array,当我为数组分配内存时 I do not expect that you will expect an extension, so并没有意料到你会想要进行扩展 i use whatever memory available adjacent to that block所以我就把与你那块相邻的可用内存块 for other variables分配给了其它变量 in some cases I may extend the same block, but在某些情况下 我也许能够扩展 in this case, I have an element a variable ‘x’但现在我有一个元素或者说一个变量’x’ next to your block.与你的内存块相邻 So, i cannot give you an extension所以我不能给你扩展 so Albert is like what all options do i haveAlbert则问我还有什么选择吗 Memory manager is like, you can tell me the new size and I can recreate a内存管理器说你可以告诉我新的数组长度 new block at some new address然后我在某个新的地址上重建内存块 and we will have to copy all the elements from the previous block to the这样我们得从原先块中复制所有的元素 new block到新的内存块中 so Albert says that ok, let’s do itAlbert说可以接受 那就做吧 but the memory manager is like但内存管理器则又说 you still need to give me the size of the new block你需要告诉我新内存块的大小 Albert thinks that this time he will give a reallyAlbert认为这次他应该给 large size for the new array or the new block.新内存块或新数组分配很大的空间 so that it does not fill up.所以它不会被填满 this new block starting address 224 is allocated新内存块的起始地址被分配到了224 Albert asks memory manager to free the previous block.Albert请求内存管理器释放先前的内存块 and this is some cost. He has to copy all the elements, all the numbers from这是不小的开销 他必须全盘复制先前块中的元素 the previous block into the new block到新内存块中 and now he can add one more element to this list现在他可以给这数列添加一个新的元素 and he has kept his array large this time他还必须保持数组的容量足够大 just in case he needs more numbers in the list防止数列需要添加更多的元素 The only option that Albert这样Albert仅有的选择 had was to create便是创建一个全新的 ‘A’ as an entirely new block, as an entirely new array内存块或数组‘A’ and albert is it still feeling bad because if the list is too small但Albert还是很郁闷 因为如果数列太短 he is not using some part of the array数组的未使用部分就闲置了 and so memory is getting wasted内存因此被浪费 and if the list again grows too much he will again have to而且如果数列扩充速度太快 那么他又只能 create a new array, a new block and he will again have to copy all the elements重建一个新的数组 新内存块 又得从先前块中 from the previous block into the new block全盘复制所有元素到新内存块中 Albert is desperately seeking a solution to this problemAlbert渴望寻找一种解决方法 and the solution to this problem而解决这个问题的关键就是 is a data structure named linked list一种叫做链表的数据结构 so let us not try to understand我们先不阐述链表的含义 linked list data structure and see how it solves而先看看它究竟如何解决 Albert’s problemAlbert的问题 what Albert can do is thatAlbert现在要做的 instead of asking the memory manager for an array并非向内存管理器申请数组 which will be one large contiguous block of memory因为数组会占用掉一大块连续的内存 he can ask memory for one unit of data at a time在一个独立的请求中 for one element at a time他向内存一次请求一个数据单元 in a separate request来存储一个元素 I’m cleaning up the memory here我先清理一下这里的内存 once again let’s say Albert wants to store重新假设Albert打算在内存中 this list of four integers in the memory存储一个包含4个整数的数列 what if he requests memory for one integer at a time.假如他一次请求一个整数的内存空间 So, first he pings memory manager for所以首先他向内存管理器请求 some memory to store number six一些内存来存储数字6 memory manager will be like ok you need space to store an integer内存管理器说可以 你需要空间存储一个整数 so you get this block of four bytes at address 204所以你得到了地址为204的4字节内存块 so Albert can store number six here所以Albert能在这里存储数字6 now Albert makes another request a separate request现在Albert想另外独立地请求 for number five数字5的存储空间 let’s say he gets this blocks starting address two one seven for number five假设他给数字5拿到了起始地址为207的内存块 because he makes a separate request, he may or may not get因为他做的是独立请求 他可能拿到的不是 memory adjacent to number 6.邻近数字6所在地址的内存块 higher probability is that he will not get an adjacent memory location他更可能没有拿到邻近的内存地址 so similarly Albert makes uh… separate所以类似的 Albert给数字4和2 requests for number four and two做的也是独立请求 so let’s say he gets these two blocks假设他为数字4和2请求 at address 232 and 242 respectively four and two得到了分别以232和242为首地址的两个内存块 so as you can see when Albert makes separate request for each integer所以由此可知当Albert为每个整数独立请求时 instead of getting one contiguous block of memory,并没有得到一个连续完整的内存块 he gets these disjoint non-contiguous blocks of memory而是一些不连续也不相邻的内存块 so we need to store some more information here所以我们需要给它们存储一些额外信息 we need to store the information that我们需要存储信息来说明 this is the first element in the list这是数列的第一个元素 and this is the second element in the list这是数列的第二个元素 so we need to link these blocks together somehow我们需要以某种方式连接这些内存块 without array, it was very simple不能用数组 虽然它很简单 we had one contiguous block of memory, so因为数组在内存中就是个连续的内存块 so we knew where a particular element所以可以根据对应内存块的首地址 is by calculating its address using the以及元素在数列中的位置 starting address of the blocks and the position of the element in the array来计算出数组中一个特定元素的地址 but here, we need to store the information that但在这里我们需要存放信息来说明 this is the first block which stores the first element这是存放第一个元素的第一个内存块 and this is the second block which stores the second element这是存放第二个元素的第二个内存块 and so on以此类推 to link these blocks together要想连接这些内存块 and to store the information that this is存储能够说明这是数列中 the first block in the list and this is the second block in the list第一个还是第二个元素之类的信息 what we can do is that我们要做的是 we can store some extra information with each block在每个内存块中存储额外的一些信息 so what if we can have two parts in each block something like this假如每个内存块包含两个部分 就像这样 and in one part of the block, we store the data or the value一个部分用来存储数据元素的值 and in the other part of the block we store the address of the next block.另一个部分则用来存储下一内存块的地址 in this example in the first block the address part在本例中 第一个内存块的地址部分是207 would be 217, the address of the next block that stores 5即存放着数字5的下一内存块地址 and in this而在这里 next block or the second block address part would be 232第二个内存块的地址部分则是232 In the block at address 232在地址为232的内存块中 We will store the address 242地址部分存放着242 the address of the next block that stores number two即存放数字2的下一内存块的地址 and the block at 242 is the last block.地址为242的块是最后一个内存块 there is no next block after this so in the没有下一个内存块 address part we can have所以地址部分是0 address as zero, zero is invalid address0是无效地址 zero can be used to mark that this is the end of the list0能用来标记这是数列的末尾 there is no link to the next表示在这个特定块后 uh… node or next block after this particular block没有下一内存块或结点与它连接 so Albert now actually has to request所以实际上Albert现在必须 memory manager for a block of memory that will store请求内存管理器提供一个内存块来 two variables存储两个变量 one an integer variable that will store the value of our element其中一个整型变量用来存储元素的值 and one a pointer variable that will store另一个指针变量则用来存储 the address of the next block the next node in the list数列下一结点所在的下一内存块的地址 in c he can define他可以用c语言定义 a type named Node一个叫做Node(结点)的类型 like this就像这样 he will have two fields in the node, one to store the data这个结点会有两个区域 一个用来存储数据 this field will be an ineteger这个区域就是存放一个整数 and one more field to store the address of the next node on the list另一个区域用来存放数列下一结点的地址 so Albert will ask a Node所以Albert要请求一个结点 Albert will ask memory for a node from the memory manager于是便向内存管理器为结点请求内存空间 and the memory manager will be like, Ok内存管理器觉得可以 you need a node that needs 4 bytes for你需要4字节存储一个整型变量 an integer variable and four more bytes for还需要额外的4字节用来存储 the pointer variable that will store the address存放着地址的指针变量 Pointer variable also in a typical architecture is stored in four bytes指针变量在典型架构中也是以4字节为单位存储 so now memory manager gives us a block of 8 bytes.现在内存管理器给我们提供了8字节的内存块 and we call this block – a Node我们称这个内存块为一个Node结点 Notice that the second field in the node structure is Node star which means注意结点的第二块区域存放的是*Node pointer to node即指向结点的指针 so this field will only所以这块区域只能用来 store an address of the next node in the list存放数列中下一结点的地址 so if we store the list like this in the memory as these non-contiguous如果我们在内存中用不相邻但能够彼此连接 nodes connected to each other的形式来存储数列 就像这样 but then this is a linked list data structure这便就是链表的数据结构 Logical view of the linked list data structure will be something like this链表的数据结构从逻辑上看就像这样 data is stored in these nodes数据被存储在这些结点中 and each node stores the data as well as the link to the next node每个结点除了数值还存储着指向下一结点的指针 so each node kind of points to the next node所以每个结点能够指向下一结点 the first node is also called the head node第一个结点也被称为头结点 and the only information about the list that we keep all the time is数列唯一能一直保存的信息就是 address of the head node头结点的地址 or address of the first node或者说第一个结点的地址 so address of the head node kind of gives us access to the complete list所以头结点的地址使得我们能够访问整个数列 the address in the last node is NULL or zero最后一个结点的地址是NULL或者0 which means that也就是说 the last node does not point to any other node.最后一个结点并不指向其它任何结点 now if we want to traverse the linked list现在如果我们想遍历这个链表 the only way to do it is we start at the head唯一的方法是从头结点开始 we go to the first guy and then we ask the first guy the address the next guy我们向第一个家伙询问下一个家伙的地址 adress of the next node and then we go to the next node and ask即下个结点的地址 接着又向下一结点 the address of the next node请求再下一结点的地址 and this is the only way to access the elements in the linked list这是访问链表元素的唯一方法 if we want to insert a node in the linked list如果我们要在链表中插入一个结点 let’s say we want to add number three at the end of the linked list假设我们要在链表的末尾添加数字3 then all we need to do is first create a node in the linked list我们首先需要在链表中创建一个结点 sorry first … create a node independently and separately it will get创建一个独立且不相关的结点 some memory location它会有相应的内存地址 so we created this node with value 3. Now all we need to do所以我们创建了值为3的结点 现在我们需要 is fill the address properly, adjust these links properly正确地填入地址 使它们连接起来 so the address of this particular node所以这个特定结点的地址 will be filled in this node会被填入值为2的结点中 with value 2. And this node the address part can be NULL,而且这个结点的地址部分是NULL so it is the last node, it does not point to any other node因为它是最后一个结点 不指向任何其它结点 let’s also show this uh… these nodes in the memory here我们展示一下内存中的这些结点 so i have written the address of each node in我已经用棕色在每个结点的上方 brown at top of these notes写下了它们对应的地址 and i have also filled in this address field of each node我也填写好了每个结点的地址域 let’s say uh…我们可以说 the Node for value three gets address 252数值为3的结点所在的地址为252 so this is how things will be in the memory and this is how the logical view这就是事物在内存中运作的方式 will be the linked list也是从逻辑角度看待链表能够 uh… is always identified by the address of the first node被头结点的地址标识鉴别的方式 and unlike arrays与数组不同 we cannot access any of the elements in constant time我们不能在恒定的时间内访问任一元素 in the case of arrays using the在使用数组时 可以根据 starting address of the block of memory and using the position of the内存块的起始地址和元素在数列中的位置 element in the list, we could calculate the address of the element来计算出元素所在的地址 but in this case we have to start at the head但在链表中我们必须从头结点开始 and we have to ask this element for the next element向当前元素查询下一元素的地址 and then ask the next element who is your next,再向下一元素查询后边元素的地址 it’s like playing treasure hunt.就像寻宝游戏 You go to the first guy and then you get the address for the你找到第一个家伙 得到了第二个家伙的地址 second and then you go to the second guy and you get address of the third guy.然后再去找第二个家伙询问第三个家伙的地址 so the time taken to access elements所以访问元素所消耗的时间 will be proportional to the size of the list与数列的长度呈正比关系 let’s say the size of the list is n,假设数列的大小为n there are n elements in the list在数列中存在n个元素 in the worst case to traverse the last element we will在最坏的情况下 访问最后一个元素 go through all the elements,需要遍历所有元素 so time taken to access elements is proportional to n所以访问时间与n成正比关系 or in other words we say that this operation will cost us or rather the换句话说 这个操作的时间消耗或时间复杂度 time complexity of this operation is big-oh of n insertion into the list相当于插入n个元素到数列中 we can insert anywhere in the list,我们可以在数列任意位置插入元素 we first need to create a node我们首先需要创建一个结点 and just adjust these links properly,然后恰当的进行连接即可 like say i want 10 at 3rd position in the list比如说在数列第3个位置插入10 so all we need to do is create a Node, store我们需要创建一个结点 the value 10 in the data part在其数据部分存储数值10 something like this就像这样 Let’s say we get the node at address 310假设这个结点的地址为310 So, we will adjust the address field in the second node我们现在调整第二个结点的地址区域 to point to this node at address 310,使其指向地址为310的结点 and this node will point to the node with value 4.接着需要使这个结点指向数值为4的结点 Now to insert also, we will have to traverse the list and got to that插入操作也需要我们遍历整个数列 particular position才能到达特定的位置 and so this will be O(n) again in terms of of time complexity所以它的时间复杂度也将会是O(n) the only thing is that uh… the insertion will be a simple operation,唯一的区别在于插入是简单的操作 we will not have to do all the shifts as我们无需像数组那样 we had to do in an array.做大量的元素移动 To insert something in between,为了在两者间插入元素 we had to shift all the elements by one position我们必须把所有元素一个个地 towards higher indices挪到更后的位置 similarly to delete something数列元素的删除操作与之相似 from this list will also O(n)时间复杂度也是O(n) so we can see some good things about linked list我们可以看到链表的一些优点 that is no extra use of memory因为一些内存没有被使用 in the sense that some memory is unused所以内存不会产生额外的消耗 We are using some extra memory, we are using some extra money to store the addresses虽然要使用额外的一些内存来存储地址 but we have the benefit that we create nodes as and when但当我们想要插入或释放结点时 we want and we can also free the nodes as and when we want它能够带来显而易见的好处 we do not have to guess the size of the list beforehand like in the case of arrays我们还无需像数组那样要事先估测数列长度 We will discuss all the operations on linked list and the cost of these operations在下节课 我们将讨论链表的所有操作和其消耗 as well as comparison with array以及链表和数组的区别 in our next lessons. We will also be implementing linked list in C or C++我们还会通过C或C++来创建链表 so this is all for a basic introduction to linked list以上便是关于链表结构的基本介绍 Thanks for watching !感谢观看!
  • 2021-08-24我是如何学习编程并顺利入职谷歌的?Hey YouTube so as I mentioned in a previous video嗨 我在上一段视频中提到 I didn’t study computer science我大学的时候没有主修 or Computer Engineering as my major at my university计算机科学或计算机编程 Instead I was studying statistics而是主修统计学 but on the side I learned to code mostly on my own但我通过自学编程 and eventually I became good enough to get a job at Google最终学有所成 as a full-time software engineer成为谷歌的全职软件工程师 So I briefly talked about所以我在上个视频中 how I learned to code in the same video,也简要地讲了一下我是如何学习编码的 but in this video,I wanted to go into more detail,但是在这个视频中 我想讲得更细节一点 so I’m gonna talk about my personal experience first about how I learned to code首先 我会讲一下我学习编码的亲身经历 But if you just want to find my recommendation about what you should do,如果你只是想看看我对你的行为建议 you should just skip over to this time in this video你应该把视频跳到3分零9秒 So here are the 4 steps I personally used to learn to code.这是我自己学习编码时用的4个步骤 First of all I took a few summer courses during my summer break首先我在暑假期间上了一些暑期课程 So the first course I took was an introductory programming course我上的第一门课是编程入门课程 it covered topics like loops, variables, if statements, and functions,内容包括循环 变数 若叙述和函数 and then the second course I took was on data structures and algorithms.我上的第二门课是数据结构和算法 It covered topics like trees, graphs, hash tables,它涵盖了树 图 散列表 and searching, and sorting,搜索和分类 these two courses were both taught using Java.这两门课都是用Java教的 After I took those courses I decided to learn more on my own上了那些课之后 我决定自学更多的东西 I heard that you can use something called “Ruby on Rails” to build websites我听说可以用Ruby on Rails来建网站 And I wanted to build a website我正想建一个网站 So I decided to learn Ruby on Rails所以我决定学习Ruby on Rails and “Ruby” which Ruby on Rails is based on.和Ruby on Rails的基础“Rudy” And to learn Ruby,为了学习Ruby I use the website called The Pragmatic Programmer,我用了一个叫程序员修炼之道的网站 and to practice using it并进行练习 I use this other website called Project Euler我使用另一个叫做Project Euler的网站 which gives you a ton of simple programming problems to solve.上面给了你很多待解决的简单编程问题 To learn Ruby on Rails I used Rails for Zombies,为了学习Ruby on Rails 我使用了Rails for Zombie which is an interactive website for learning how to use Ruby on Rails.一个学习如何使用Ruby on Rails的交互式网站 And step 3 I started working on a bunch of personal projects.第三步 我开始做一些个人项目 My first real personal project我第一个真正的个人项目 was going to be like the reddit of Japan有点像日本的reddit Partly because I’m originally from Japan and I was living in Japan at the time可能因为我来自日本 并且当时住在日本 Working on this project参与这个项目 was really helpful for understanding how Ruby on Rails works非常有助于理解Ruby on Rails的运行 How Ruby worksRuby是如何运行的 and also how web technologies in general work.以及互联网技术运行的一般方式 Through this project,通过这个项目 I also learned the importance of asking for help.我也学到了寻求帮助的重要性 You know, when you’re new to programming.当你刚开始编程的时候 It’s easy for you to get stuck,很容易陷入困境 and I think it’s really important for you to just ask for help我认为寻求帮助是非常重要的 You can do this virtually through websites你可以通过一些网站 like Stack Overflow, Facebook groups, or reddit,如Stack Overflow Facebook群 或reddit来求助 or in person if you know someone who knows如果你认识会自己编码的人 how to code in person.你也可以亲自去找他们 And then, using the skills and connections然后 通过利用我发展的技巧和联系 I developed, partly through my projects.部分通过我的项目 I got a few technical internships.我拿到了几个技术实习机会 These technical internships were really这些技术实习过程 helpful for developing my skills further.对我进一步发展技能很有帮助 because I started getting feedback on my code from my colleagues,因为我开始收到同事对我自编代码的反馈 and I started learning a lot more a lot faster than on my own.这比我自学要更多更快 In between those internships, and even when I had one of those internships.在这些实习岗位之间 甚至某一个实习岗位上 I kept working on more personal projects.我继续做更多的个人项目 That was partly because it was just fun,不仅因为它很有趣 and partly because I wanted to build more skills.而且因为我想要培养更多的技能 And after all that I started working on my technical interview skills,之后 我开始学习技术面试技巧 And eventually I got a job at Google as a full-time software engineer.最终我成为了谷歌的一名全职软件工程师 So if you’re just getting started with programming如果你刚开始编程 or if you’re a complete beginner或者你完全是个新手 What should you do exactly?你到底应该怎么做? I’d recommend the following four steps我推荐以下四个步骤 First of all, you should learn the basics of programming首先 你应该通过这些互动网站 through one of those interactive websites.学习编程的基础知识 I personally recommend Codeacademy,我个人推荐Codeacademy but I also heard that freeCodeCamp is also pretty good但我也听说freeCodeCamp也很不错 If you’re not sure which language to get started with,如果你不确定开始使用哪种语言 I do recommend either Python or JavaScript我建议使用Python或JavaScript after that, start working on a personal project之后 开始做一个个人项目 It could be a website, an app,它可以是一个网站 一个应用程序 or automating a simple task.或自动化一个简单的任务 As you work on your project,当你在做你的项目时 keep learning more through those interactive websites.要通过那些互动网站学习到更多技巧 And for more advanced topics that有些更高级的主题 those websites don’t cover,这些网站不涵盖的话 I’d recommend Lynda.com and Udemy.我推荐Lynda.com和Udemy Step number 3. As you work on your personal project第三步 在做个人项目时 I think one important aspect here is the community aspect.我认为社群因素非常重要 If programming is something that’s totally new for you,如果你完全不懂编程 it’s probably gonna be really hard.学习起来可能会非常困难 And so, it’s really important for you to be able to get help from others因此 通过在线或离线社群 through online or offline communities.获得他人的帮助是非常重要的 So try using websites所以试着使用一些网站 like Stack Overflow, Meetup.com,像Stack Overflow Meetup.com and Facebook groups and events to find relevant communities.以及Facebook上的群组和活动来寻找相关社区 Step number 4!第四步! Try getting an internship or a job, ideally a paid one试着找一份实习或工作 最好是带薪的 Once you do, you should be able to learn even more quickly, because找到后 你应该能学的更快 原因是 You’ll be able to get some feedback from your colleagues on your code.你能从同事那里得到一些自编代码的反馈 And those are the four steps I would personally use to learn to code today.这就是我如今会用来学习编码的四个步骤 If you have extra time and money to spare,如果你的时间和金钱足够充裕 going through a coding bootcamp,参加一个编码训练营 Or even getting a degree from a university might also be a good option或者拿到一个大学学位也挺不错 Okay, if you’re curious about好的 如果你好奇 a more general strategy I used for getting a job at Google我就职谷歌使用的更完整策略 there’s a video about that,我有一个这样的视频 and if you’re wondering which programming languages you should learn,如果你想知道你该学习哪种编程语言 I have a video about that too我也有一个这样的视频 And let me know in the comment section below about在下面的评论区告诉我 what kind of videos I should make in the future.我以后应该做什么样的视频 I’m YK from CS Dojo, 我是《开发大师》的YK and I’ll see you in the next video! 下个视频再见!
  • 2021-08-24#0 什么是数据分析?Okay, so artificial intelligence, machine learning, data mining, data analysis,人工智能 机器学习 数据挖掘 数据分析 clustering classification, data pre-processing,聚类分析 数据预处理 big data.还有大数据 00 – 数据分析入门 电脑狂热 It’s hard to go anywhere now without hearing about AI and machine learning and data,现在到处都在谈论AI 机器学习和数据 data particularly, it’s everywhere.尤其是数据 到处都是 Researchers have suggested that every two years we generate more data than ever existed before研究表明 我们每两年就会制造出比以往更多的数据 So the amount of data is doubling every two years.因此数据量每两年就会翻一番 The fact is actually, you know astronomical amount of data,数据量巨大是事实 but the thing is of course that, these data doesn’t necessarily mean anything.但当然了 这些数据不一定有意义 In fact, you can create tables of data其实你也可以创建大量数据 but unless you understand what’s in them and what they mean,但如果你不了解它们的构成和意义 you haven’t got any knowledge, right?你就没学到什么知识 对吧? So there’s a distinction between having data and having knowledge.所以有数据和有知识 二者是存在区别的 So very well saying, yes, as a species, we’re producing a huge amount of data,没错 我们人类制造出了大量数据 but actually a lot of it doesn’t get used.但大部分数据其实并未得到使用 a lot of it sits there on a hard disk, waiting for someone to look at it.大量存储在硬盘上的数据在等着人们来探索它 And that’s kind of what we’re talking about here.这就是我们要探讨的这类数据 If we want to extract knowledge from data,如果要从数据中提取知识 we’re going to need some tools and processes to do this in a formal way,我们就需要一些工具和过程来正式地实施 and that’s that’s what data science is, right?这就是数据科学 And things like machine learning and AI have a place within it而机器学习和AI之类的技术在这一领域就有着一席之地 So perhaps if you do this for your job,如果这是你的工作 then data analysis is going to be useful for you.那么数据分析将能助你一臂之力 Maybe your company’s generating data and you want to analyze this data?也许你的公司生成数据而你要对其作出分析 But on the other hand, perhaps you’re just a consumer, and companies are using data on you.也有可能你只是一名消费者 而公司在对你使用数据 They’re generating data on you, and actually they’re profiting from data on you.他们拿你去生成数据 而且还拿你的数据去盈利 These are sometimes life-changing decisions that are being made on your data.他们有时会用你的数据做一些改变人生的决策 And so it’s empowering to know how this process works.所以你最好能明白这个过程 And I have a very simple example which you might even do yourself.我举一个非常简单的例子 你甚至可以自己动手试试 Suppose you go online to book some flights for a holiday,假设你上网去订一个度假的航班 and then you decide that actually two flights via an intermediate airport你发现 订两个中转航班 is cheaper than a single flight, right?比订一个直达航班要便宜一些 You’re doing data analysis. Say you’re taking lots of different data sources你这就是在做数据分析 你收集了大量不同数据源 and working out the optimal route.然后找出了最佳路线 And this of course happens automatically as well,当然这个过程也可以是自动的 depending on the flight website that you’re using.这要取决于你用的哪个航班网站 All right, so this kind of stuff you’re already doing it.你已经在做数据分析了 It’s just a case of trying to formalize this process.这就是一个将你订航班的过程形式化的情况 So what do any of the things I listed at the beginning mean?那么 我在视频开头列出那些名词是什么意思呢? Well, one problem is that everyone’s definitions differ slightly,每个人对此都有不同的定义 but also I think that a lot of these terms are used completely interchangeably.但我认为 大量这种术语是完全可以通用的 AI is the classic example.AI就是一个典例 So AI is everywhere, right? You can’t buy a product without it having been having AI added to it.AI到处都是 购买产品都要用到AI A lot of the time you see AI,但往往你看到AI这个词的时候 we’re actually talking about machine learning我们实际探讨的是机器学习 So machine learning is the idea that we’re training a machine to perform a task机器学习指的是 在没有显式编程的前提下 without explicitly programming it to do so.训练机器执行任务 A good example of AI that isn’t machine learning would be, let’s say a mouse in a maze,迷宫中的老鼠 是说明AI并非机器学习的一个好例子 where all you’re doing is telling it to turn left or right at random.你只需要随机告诉它左转还是右转 Not learning anything, it doesn’t understand what the maze is这只老鼠没有学到任何东西 也也不明白什么是迷宫 but it will eventually get to the end, right?但最后它还是会走出迷宫的 That’s a kind of rudimentary artificial intelligence that doesn’t involve learning anything.这就是一种不涉及学习的基本人工智能 Machine learning is about not giving it conditions,机器学习不是给出条件 not saying “if you’re here, turn left; if you’re here, turn right”.不是“如果你到这里就左转 到这里就右转” It’s just giving it examples and hoping it will learn to perform most tasks itself, right?而只是给出案例 希望机器能学会自己去执行大多数任务 So machine learning is a subset of AI, but they shouldn’t be used interchangeably.所以 机器学习是AI的一部分 二者不应通用 If we’re using machine learning, often what we’ll do如果要用机器学习的话 我们往往要 is we train it based on samples of data.基于数据样本进行训练 So we’ll have some existing data set that we’re trying to train on,所以要有一些用于训练的数据集 and we’re trying to use machine learning to either然后试着用机器学习 tease out information or make predictions on these data.来梳理个中信息或作出预测 The problem is that not all data is sort of made equal.但问题在于 数据质量不一 Some of its noisy and messy, maybe we don’t know what it is有的数据非常混乱 存在很多噪声 我们可能看不出它是什么 and don’t know whether we can apply a certain technique to it, right?也不知道能否用什么技术来处理它 And so we need to clean this data up.所以我们需要进行数据清洗 We need to take this data, understand what it is and extract some knowledge,我们要获取数据 了解数据 并且从中提取出一些知识 so that we can then apply these AI or machine learning techniques to it.然后才能对其应用AI或机器学习技术 So this combination of things that can take data and prepare it获取数据 以及为使用和理解它们 in a way that we can then use it or understand it, that’s data science.而做准备的整个过程 就是数据科学 There are quite a few ways we could do this data analysis right throughout this course.本课用到的数据分析的手段有很多 We could use R, we could use Python, we could use MATLAB. They all have their pros and cons包括R Python和MATLAB 这些工具各有利弊 We’re gonna use R because it’s free and it’s really good for statistical analysis我们将选择R 因为它免费 而且非常适合统计分析 It’s got loads of great libraries.还有大量的包 If you’re really familiar with Python, then maybe that’s what you want to start with for this kind of stuff.如果你很熟悉Python的话 也可以用它来入门 But we know we’re going to be working with R但在本课中 我们要用R We have our script area here where we can write scripts and run scripts.这里是脚本区域 可以写脚本和运行脚本 You can save them and then come back to them later.你可以保存脚本 回头再来编辑 Console where we’re going to be putting in, you know, specific commands.这里是控制台 用于运行一些命令 We have our environment, which is where all our variables and our data is held环境是储存和查看变量与数据的地方 and we can look at them there.可以看这里 And then we have plots, any plots, which you can do quite a lot of different plots in R, very versatile.还有图像 你可以用R画很多种类的图像 非常万能 That’s going to appear down here.图像会显示在这里 Okay, so you’ve probably got everything you need to get started with data analysis.有了这些 你就可以开始数据分析了 In my opinion, the best way to get into R is just to kind of have a go.在我看来 学习R的最佳方法就是上手试试 So it’s going to look at a few of the most obvious things that it does.我们将看看它的一些常用功能 It has a little bit of a learning curve only because it’s syntax is slightly unusual.R学起来有点费工夫 只因它的语法有些与众不同 If you can program you’ll be fine, but even if not, you should get there pretty quickly.如果你有编程基础 就没什么问题 但即使没有 你也能很快上手 Most of the time in R we’ll be using either matrices or vectors在R中 我们大部分使用的是矩阵 向量 or which are kind of a special case of matrices or maybe data frames.或矩阵的一种特别形式 或数据框 Data frames a really nice aspect of R,数据框是R中非常好用的一个对象 which you can kind of think of like a table that you might have in in Excel,你可以把它想成是Excel中的表格 except you’ve also got headings for your columns.但数据框是有列标题的 So let’s have a look at some of these things, and just a few of the things we can do with them我们先学习一下其中几个对象的使用 before we perhaps go into a little bit more detail in other videos.然后再在其他视频中详细学习 So for example, we might look at our variable X which I’ve created举个例子 看我创建的这个变量X and X is a sequence going from 0 all the way up to a few multiples of Pi,X是一个从0到Pi的几倍数的序列 which I used to create this plot.我用它画了这个图 That was only one line of code that produced that画这个图只用了一行代码 and I’ve used that to create my plot by essentially saying y equals sin(X),说明y=sin(x)之后 and then just simply plotting that.就能画出这个图了 If you wanna get a little bit more complicated, we can start looking at matrix data.如果你想更复杂点 我们可以考虑一下矩阵数据 So I created a CSV file with a Gaussian function in it.我创建了一个包含高斯函数的CSV文件 So essentially a two-dimensional array of values高斯函数就是一个二维数组 that get bigger in the center. Very straightforward.越靠近中心的值越大 很好懂吧 The CSV file is essentially a text file with commas separating those values,而CSV文件是一个用逗号隔开这些值的文本文件 very easy to read and write these out of Excel and other packages它用Excel和其他包来读写都很方便 and so you’ll often find data is passed around in this way,所以数据经常会以这种形式传输 at least moderately sized data, if it isn’t too, you know to it too huge.至少不太大的数据如此 I can load this in using my “read.csv” function.用“read.csv”函数可以导入这份数据 So I can say “namedata”.并将数据存在“namedata”名下 Now the arrow operator is essentially equivalent in R箭头运算符在R中的作用 for the assignment operators or equals.和赋值运算符等号基本相同 Equals will often work, but I tend to try and use this one. So “namedata”…等号通常也能行 但我喜欢用箭头 输入“namedata” I’m going to assign “read.csv” and the file is going to be “norm.csv”我要把”read.csv”这个函数赋给它 文件是”norm.csv” And I’ve got no header for this file,由于这个文件没有数据头 so I don’t want it to use the top row for the labels所以不需要把第一行作为列标签 So I’m going to say “header” equals “false”.所以“header”参数等于”false” And that’s loaded in “namedata”. And we can have a look,然后数据就存到“namedata”里了 我们可以看看 so I’m gonna click on “namedata” here.我点一下这里的“namedata” And if we click on it, you can see we’ve got点击它 你就能看到 the rows and the columns of our data in here.数据有多少行 多少列 We can look at individual elements in this array.我们可以看看这个数组中的单个元素 So we can say data at position three four,比如想看坐标为[3, 4]的数据 and that’s going to be the third row down and the fourth value across.这指的是第三行第四列的数据 We can also leave one empty and just have an entire row,我们也可以空出一个参数 or conversely, an entire column, like this.查看一整行或一整列 And so it’s very easy to take ranges of values.所以查看指定范围的值是很容易的 You’ve got a huge table of data selecting certain columns,这个大的数据表可以用来查询指定列 looking at certain columns, plotting certain columns.查看指定列和给指定列画图 This is one of the reasons why R is very popular.这便是R如此常用的原因之一 Quite often when you’re looking at data,往往我们在查看数据的时候 we’ll actually be looking at something called a data frame.查看的是一种叫数据框的东西 Now a data frame – I’ve got a load one up –我已经载入了一个数据框 is simply a… In essence, a table of values, but it won’t have to be the same type.它其实就是一个数据表 但数据类型无需一致 So in an array, normally they’ll all be floats or they’ll all be integers.所以数组中的值一般都是浮点数或整数 In a data frame, there can be different things,而数据框中的值类型可以有所不同 so you could have first and last name next to age, for example.比如你可以在年龄旁边写上姓和名 So I’ve just created a tiny little CSV file我刚创建了一个小的CSV文件 with some random people in it. So let’s load this up.里面有一些随机的人员信息 我们来载入看看 So I’m going to say “namedata”输入“namedata” assign “read.csv(names.csv)”赋值以“read.csv(names.csv)” And if I look at “namedata”, you can see that it’s got three columns,查看“namedata”数据集 你可以看到它有三列 it’s got firstname, surname and age,分别是“名” “姓”和“年龄” and five rows, and there’s five people in this dataset.有五行 表示数据集中有五个人 And then you can do just like I did before,然后你可以像我之前那样做 but now we can also index by the names of these columns.但现在我们也可以根据列名来查找 So I could say I want all of the first names for example,举个例子 如果我想知道所有的名 so I can say “namedata$firstname”就可以输入“namedata$firstname” and I can see all the different first names.然后我就能看到所有的名了 So you can start to look at this data set and more in more detail.你可以浏览数据然后逐渐深挖细节 Obviously, this isn’t absolute tiny data set, but you get the idea.显然这不是一个绝对小的数据集 但你应该明白我的意思 You could also look at individual instances, so we could say “namedata”.你也可以查看单个实例 先输入“namedata” And I want just the second row, for example, “namedata[2,]”.如果我只想看第二行 就输入“namedata[2,]” There we go, Bill Jones and he’s 18 years old.结果出来了 这个人叫Bill Jones 18岁 As we move through these videos, it’s going to be very common for us随着这些视频的学习 我们将学会 to load in datasets like this in this format.熟练地载入这种格式的数据集 and then start to process them based on these data frames.并且开始学会处理这些数据框 So perhaps an example, right? So let’s imagine you’re an online retailer,我举个例子吧 假设你是一名网络零售商 and someone comes into your shop and buy some thing.有人到你的店里买东西 And maybe they… you’re trying to understand what it is what they do, so that you can,你试着去了解他们的购买行为 这样才能 let’s say, send them emails to try and get them to buy more products,举个例子 才能给他们发邮件 吸引他们购买更多商品 or show them recommended products and things like this.或者给他们推荐商品 等等 So you want to try and build up a pattern of their behavior, right?你想给他们的行为建立模式 And all you’ve got is what they click on, what they add to their basket,而你掌握的信息是 他们点击了什么 添加了什么到购物车 and what they buy, right?以及购买了什么 So you’ve learned that they’re looking at these kinds of items and they look at these ones regularly.你知道他们浏览了这几种商品 以及经常浏览这些商品 And then sometimes they just buy something completely random seemingly,但有时他们会看似非常随机地购买商品 and that goes in their basket and gets bought straight away.把商品加入购物车然后直接购买 Maybe it’s a present right? So maybe it’s not tied to them as a person.可能他是在买礼物?所以这次购买行为可能与他本人并不挂钩 So you’re taking all of this data all of these purchases, all of these… products that they’re looking at,你把这些购买记录 和他们浏览过的所有商品记录了下来 and you’re turning this into a kind of picture of this person,把这些数据绘制成了此人的图像 and you’re clustering that person in with other consumers that bought similar things,把此人和其他购买相似商品的消费者归为一类 and trying to predict what they want to buy next, right?并试着预测他们接下来要买的东西 And that’s when you send them an email say “you should look at this one这时候你就可以给他们发邮件说 “你应该看看这个商品 because this one’s really good and you didn’t buy it last time, but you’ll definitely want to buy it this time”.上次你没有买它 但是它真的很好 这次你一定会想买的” So we’ve got some data we want to extract some knowledge.我们掌握了一些数据 想从中提取一些知识 What’s the first thing we do?第一步该做什么呢? We have to start to look at it我们必须开始浏览数据 and try and tease out some kind of information or analyze this data.尝试从中梳理出一些信息或作出分析 The data analysis is the idea of using statistical measures to try and work out what’s going on.数据分析是用统计手段把事情弄明白的过程 This is kind of a cycle. We’re going to analyze the data so we’re going to do a data analysis,这是一种循环 我们要分析数据 所以要进行数据分析 and perhaps sometimes just using statistics to analyze the data isn’t enough.但或许有时只用统计资料分析数据并不够 You can’t really learn everything about it.你无法因此而了解它的全貌 Yes, you can learn, you know, mathematically how it works,的确 你能了解到它的数学原理 but you might not understand about what it all means但可能无法理解它的全部含义 So visualizing the data can be really helpful.而对数据进行可视化可以助我们一臂之力 So what we’ll also do is we’ll visualize the data – visualization.所以我们要对数据进行可视化——数据可视化 So that’s going to be charting it, plotting it,数据可视化指的是对数据做表 画图 trying to work out trends and links between different variables and things like this.找出趋势和不同变量之间的联系 等等 And these are kind of being back and forth, right,这种事情可以反复做 you could do both of these things numerous times and work out what we’ve got, right?可以在重复多次之后找出我们想要的东西 So you’re gonna do something like this.这是你要做的 And then what we’re going to do is we’re going to preprocess the data.然后我们要进行数据预处理 Often you’ll be finding your recording much more data than you actually need. Right.有时候你会发现 你记录的数据比你所需要的多得多 This is certainly true of an online shop.网店就经常出现这种情况 I’m going to be looking at a lot of products,我浏览了很多商品 but I don’t end up buying and I was never really going to buy.但最终并未购买 而且其实我本就并无购买的意愿 I know maybe a pipe dream.可能会在梦中买吧 And they’ve got a sort of weed out this information店主把这条信息清除掉了 to work out what it is that they might actually better convince me to buy right?这样才能真正找出那种可以说服我购买的东西 So this is going to you going to preprocess data and remove a nonsense,这就是预处理数据 删除无意义数据 and drill right down to the stuff that’s really useful.然后深入挖掘出真正有用的东西 So this is preprocessing.这就是数据预处理 And this is going to be a kind of cycle of analysis and visualization and preprocessing,数据分析 数据可视化和数据预处理可以构成一个循环 and we can repeat these things and then we can really drill down and whittle down our data然后我们就能深挖 尽可能将数据压缩到 into the most usable sort of core of knowledge that we can.只剩下最有用的核心知识 And get the most out of it.然后充分利用它 Now it may be that just analysing the data is enough, right?可能只分析还不够 You’ve now sort of you’ve obtained some knowledge.你现在已经学会了一些知识 You kind of understand what the trends are.了解了趋势是什么 and maybe that was all you wanted to do. That’s sometimes the case.觉得到此为止就行 有时候的确可以这样 Maybe actually what we want to do is take things a little bit further但我们可能其实想进一步 We’re going to use machine learning or modeling打算用机器学习或建模 to try and model this system and predict what’s going to happen next.来模拟该系统 预测下一步 So for example in the case of an online shop,比如在网店案例中 we might want to start predicting what people are going to buy next我们想开始预测客户接下来的购买意向 and if we can do that, that’s when we can send out these emails如果我们能成功 就可以给他们发送邮件 or flag things in their recommended items and get many more sales.或标出给他们推荐的商品 增加销售量 As an example, let’s imagine that someone has spent a lot of time looking at DIY tools.举个例子 假设一个人花了很多时间浏览DIY工具 I’ve, you know, recently moved house I spent a lot of time doing DIY,我最近刚搬了家 也是花了很多时间去DIY and I’m always trying to buy new tools because it just seems like a good idea.我向来喜欢买新工具 因为觉得这样很好 So, you know, maybe I buy a certain kind of saw, and then you know a few months later,可能我会买一款锯子 几个月后 they’re starting to recommend me a slightly different kind of saw that serves a slightly different purpose店家就开始推荐我另一款略有不同的锯子 它的功能也略有不同 that suddenly I definitely need to be doing and I think, uh yeah, maybe I will buy that而且我应该会用得到 我想 我可以买 and then the end I have 10 saws and I don’t know how to use any of the saws.最后我就会有10把锯子了 可是我一把都不会用 But you know, the retailers job is done.但零售商的工作的确是做完了 It’s if we want to extract this data, we’re going to use machine learning or modeling如果要提取这些数据 我们就要用机器学习或建模 to put to model this system and make predictions.来模拟系统 作出预测 Now so for example, we could cluster the data together.比如可以给数据做聚类 We could link my purchase history with similar people.可以把我的购物历史和相似的人的购物历史相联系 What are they buying? Can I be tempted to buy those things as well, right?他们买了什么?我也会被这些商品吸引吗? Maybe I’m very different from someone else,可能我和其他人非常不同 and so it’s not a good idea to recommend me certain products所以给我推荐一些商品的做法并不合适 because I’m unlikely to buy those things.因为我不会去买它们的 Perhaps use a different example. In the medical domain,举个别的例子 医学界往往会 it’s quite common to classify people into kind of risk categories,把人们分为不同的风险种类 so that we can maybe use preventative treatments.以便对他们应用不同的预防治疗方案 So every time I go to a doctor, they’re going to collect data on me, on…每次我去看病时 大夫都会收集我的数据 What’s currently on with me? And what was wrong with me before? and…比如我最近经历了什么事?以前得过什么病?等等 Combine that with with you know standard data将以上数据与标准数据结合 like how much exercise someone does, and you know their family history,比如锻炼量 家庭病史 and how what their stress levels are and things like this,压力水平 等等 We can combine all these things to make a prediction as to what they were at risk of in the future,将这些数据结合 就能预测对方将来是否存在健康风险 so you know, heart disease or something else like this.比如心脏病等 这能挽救一个人的生命 It could save someone’s life if you spot如果你及时发现有人存在患某种疾病的风险 that they’re at risk of a certain thing就能挽救对方的生命 and you can really advise that person to, you know, increase their level of exercise or alter their diet.也可以建议对方增加锻炼或改变饮食习惯 There are two other terms that we come across, you know a lot, right?我们还要学习另外两个知识点 你应该很清楚 So there’s data mining and big data.是数据挖掘和大数据 Now, I’m not really sure what data mining is, because I don’t think anyone is.我不是很清楚数据挖掘是什么 因为我觉得没人清楚 it’s a bit… it’s a bit of a buzzword它是…它是一个流行词 Really, what data mining is is a combination of preprocessing your data数据挖掘其实是数据预处理 and maybe using clustering to extract some knowledge from it.和用聚类提取知识的组合 So that’s our sort of… it’s a word that’s come to be used in place of those things.所以其实这个词是用来指代二者的 If someone says they’re doing data mining, that’s what they’re doing.如果有人说自己在做数据挖掘 那么他做的就是上述的事情 They’re preprocessing and extracting some knowledge from their data是数据预处理和从数据中提取知识 It’s a cool sounding word. You’re not actually “mining” anything, right?这个词听起来很酷 但你其实不是在真的“挖”东西 You’re just doing what everyone else does on data.你只是在做每个人都在对数据做的工作而已 Big data is the idea that maybe we collect a lot of examples of something, you know, a huge number,大数据指的是我们收集了某事的大量样本 海量样本 or each of our examples is quite complicated and it has a lot of variables.或每个样本都很复杂 包含大量的变量 In that case, the amount of data we’ve got is sort of unwieldy.这么说来 我们获取到的数据量就很难处理了 So I would argue, perhaps that big data is not data that you can run on your laptop.所以我认为 大数据不是你能在笔记本电脑上运行的数据 Like, you might be using cloud compute, infrastructure or certainly parallel processing而是要用云计算 基础设施或是并行处理 in some way to to preprocess and analyze this data.来预处理和分析数据 So exactly where the line, how big is “big”.那么大数据究竟有多“大”呢? I don’t know, but exactly where we draw the line in some ways is not really important,我也不知道 但是究竟有多“大” 这个问题并不重要 the idea is just that the amount of data we as a species are now producing但我们人类如今在产生的 more and more of our data is becoming big data.越来越多的数据 逐渐构成了“大数据” But you know exactly where the cutoff is doesn’t really matter.但你也清楚 这个边界并不重要 What is data? I’m pretty sure that’s data.什么是数据?我很确定那个是数据 Is this data, this picture? Or that data?这张手机上的图是数据吗?这份杂志是数据吗? Is this data? What is data?这张纸是数据吗?什么是数据呢?
  • 2021-08-2411/44 字符串格式化>> So previously, we saw how we could take a couple of strings在前面的学习中 我们知道了如何将多个字符串 and combine them together用加号运算符 by using that literal plus operator.把它们连起来 Now, we’ve already seen this slide and现在 我们已经看过这张幻灯片了 if you take a look at the code in particular that fourth line,如果你看一下这个代码 尤其是第四行 what you’re going to notice there is there’s a lot going on,你就会注意到 这里做了很多事 that I’ve got that little plus right there in the middle,我在这中间用了一个小加号 with my string literal,连接这个字符串常量 and then I’ve got another plus here,然后这里有另一个加号 and another string literal,和另一个字符串常量 and another plus here,这里有又一个加号 and another string literal.又一个字符串常量 You know, this could get unwieldly pretty quickly,这样表达式很快就会变得很笨重 and this is not even这甚至还没有考虑到 taking into account that we might want to be calling capitalize,我们可能想调用首字母大写函数 or upper, or lower,全部字母大写或全部字母小写函数 or some of those other helper functions that we might have,或者我们的一些其它辅助函数 and our codes are just going to我们的代码只会变得 keep getting longer, and longer, and longer.越来越长 越来越长 越来越长 So let’s try and simplify this a little bit.所以让我们试着把这简化一点点 This is where our format strings come into play.这就是我们的格式化字符串发挥作用的地方 Now, what we can do is that little output that we see up at the top现在 我们来看到最上面一行的output变量 where again we’re calling这里我们还是使用了加号 that plus sign that we’ve already seen,这我们已经讲过了 or we can streamline this by using placeholders.或者 我们可以使用占位符来简化掉它 Now, each one of these outputs that you see up here on the slide现在 你看到的幻灯片上的每一个output变量 is going to give us the exact same string.都会给出完全相同的字符串 We’re just going to do it slightly differently.我们只是用略微不同的方式来做这件事 So the first one,那么第一种方法 what we’re doing here is we’re going in and我们在这里要做的是 we’re putting in place holders with those curly braces.在里面用这些花括号占位 Now, the way that this works is it’s going to be based on那么 这种占位的工作方式 the order in which we specify the parameters.会基于我们所指定的参数顺序 So that first one there,所以第一个占位在这 that’s going to be first_name in my case,在我的示例里它将是first_name变量 and that second one,然后第二个占位 that is going to be last_name.将是last_name变量 Now, if we wanted to specify it,现在 如果我们想指定占位符对应的参数 what we can do instead我们可以换一种方式 is we can use the zero, and the one就是可以用0和1 which then allows us to specify来让我们分别指出 the first and the second respectively.第一和第二个占位符指向的字符串变量 Remember, that counting will start with zero,记住 计数是从0开始的 so that zero is going to be the first item,所以0将是第一项 and the one is going to be the second item.1将是第二项 Now, in my example up here,现在 在我这里的示例中 it’s not going to make a difference,第二种方法不会带来什么改变 that both of them are the exact same.两种方法的结果是完全相同的 But if I need to potentially reuse但是如果我可能需要在其它地方 the exact same string somewhere else复用完全相同的字符串 or maybe I just want to document it,或者我只是想记录下 show hey, this is going to be the first,表明 嘿 这将是第一个字符串 this is going to be for the second,这将是第二个字符串 this is going to be for the third,这将是第三个字符串 then I could go ahead and put in that zero, the one,那么我就可以直接在花括号中插入0 1 and then maybe a two later on as well.然后也许再插入一个2 Now, the last example that I want to highlight here,好 在最后一个示例中我想高亮这里 and I want to make sure that I point out the fact,我必须要指出这一点 that was not what I wanted,不是用这个标记 I wanted that, there we go.我要用这个 高亮它 I want to make sure that I point out the fact that我要明确指出的一点是 this is only available in Python 3.这个方法只在Python 3中可用 So if you’re doing anything that needs to run in Python 2,因此 如果你的工作需要在Python 2下运行 this last example is not going to work,最后一个示例将无法运行 but it will work in Python 3.但它可以在Python 3中运行 That is where I put in f right at the very beginning,这里我在开头输入了一个f f being for format,f表示格式化 and now what I’m able to do and I love this functionality,那么现在我就可以用这种方式 我喜欢这一点 is I’m able to now just use my variable names我现在可以直接将我的变量名 right in line with everything else.嵌入到其它部分代码里 This is my preferred method每当我做字符串连接时 whenever I’m doing string concatenation,这是我的首选方法 because it’s nice, it’s clear, it’s self-documenting,因为它友好 清晰 自文档化 you always want your code to be self-documenting,你总是希望你的代码是自文档化的 and when somebody comes back当有人回来看代码 assuming that they understand the little f at the beginning,假设他们理解开头的小f的话 it’s very easy for them to go,他们读代码就很容易 “Oh, that’s going to be my first name,“噢 那将是我的名 and that is going to be my last name.”那将是我的姓” Let’s see how all of this works下一个小视频就让我们看看 inside of code in our next little video.所有这些是如何在代码中工作的吧
  • 2021-08-24数据分析We’re going to try something a little bit different in Computerphile today.今天我们将在《电脑狂热》节目里做些不同的尝试 We’re going to do a series of a few videos linking together,我们打算把一系列的视频都联系起来 all about data analysis,big data,data mining,the kinds of stuff.关于数据分析 大数据 数据挖掘等等 So I guess this idea我知道这个想法 maybe is quite a broad topic.似乎涉及了很多话题 We can’t cover it in detail in one video我们无法在一个视频里讲得明明白白 and also some of the videos will link together.还有些视频也能串联起来 So maybe we have a little series on this所以我们可能会推出不同的系列 we can start to really learn about some of this in more detail在你看完所有视频之后 the idea being that by the time you’ve watched all the videos我们能真正了解到产生想法的更多细节 You’ll have really a good idea of how to analyze data你可以更清楚地知道如何分析数据 what you can do with data because data isn’t going away.能用数据做什么 因为它们不会消失 There’s a lot of it about.这里有很多例子 I spent lots of time doing DIY,trying to buy new tools我花了很多时间做DIY 买新工具 because it seems like a good idea因为这听着就很棒 So,you know,maybe I buy a certain kind of saw所以我可能会买某一种锯子 and then a few months later然后过了几个月 they’re starting to recommend me a slightly different kind of saw他们开始给我推荐有轻微差别的锯子 that serves a slightly different purpose用在不同的目的上 that suddenly I definitely need to be doing and I think…突然间 我觉得我确实需要它们 maybe I will buy them and have 10 saws我想 我可能会买下它们 这样就有10把锯子了 and I don’t know how to use them但我却不知道如何使用它们
  • 2021-08-2410/44 字符串示例Strings Code字符串代码 So let’s take a look我们来看一下 at some of those neat little string functions that we saw previously.一些之前看过的简洁的字符串函数 So I’m going to start off by just simply typing in首先呢我从简单的语句开始 first_name equals Christopher,first_name = ‘Christopher’ we’ll just skip that stored,我们跳过这步存储 and then last_name, equals Harrison.然后是 ‘last_name = Harrison’ Perfect. Now, like we have already seen,,好极了 现在 正如我们所见 we can concatenate things together by using that plus sign.我们可以通过使用加号将字符串连接起来 So if I say “print” and I say “first_name”,因此 如果我输入print以及first_name and by the way,哦对了 a little IntelliSense Beta functionality for you.介绍一个测试版功能IntelliSense If you see the word that you want already highlighted,如果看到你想输入的词已经高亮显示了 all that you need to do is hit “Tab”你需要做的就是按下Tab键 and that will complete it out for you.编辑器将会自动为你补全这个词 It’ll save you a little bit of typing.这将帮你节省一点码字的时间 So if I say first_name plus last_ name,因此如果我输入 first_name + last_name save this,保存一下 and then I go ahead and run this.然后继续并运行 Oops, there we go.哦写错了 好了 Make sure we make, right spot.确保是正确的路径 You’ll notice that it prints out Christopher Harrison.你会看到它输出的是 Christopher Harrison Now, if I update this,现在 我更新一下 let’s go ahead and put in our print,然后放入print语句中 and then we’ll say ‘Hello’然后我们希望输出 Hello first name, a space, and then our last name.然后是名字 一个空格 以及姓氏 And again I’m using that Tab to help me out with the auto-complete.我再次使用了Tab键来自动补全 Now, what you’re going to notice is that it gives us现在你会发现 它会输出 that “Hello, Christopher Harrison” with the spaces inside out there.Hello, Christopher Harrison 以及里面的空格字符 Now, if instead of printing it out like that,如果我不想这样输出 maybe I wanted to go ahead and bring that in from the user,也许我想让使用者来填入这些信息 and let me just comment out the code我来把这些代码注释掉 so that way it will still be in existence there.将其暂时保存在此 I can go ahead and say first_name equals我可以继续这样写 first_name = input and we’ll say “Please enter your first name.”在里面写 Please enter your first name: And then actually before you do this,实际上 在这样做之前 I want you to notice right here,希望你能注意下这里 that little last name and those little green squiggles,这个last_name以及它下面绿色的小波浪线 yes that is the technical term,这是个专业术语 underneath there, that’s Visual Studio Code letting me know hey,下面这个东西是Visual Studio Code试图告诉我 there’s something not quite right here.你这里代码敲得有点问题哦 If you ever see that,如果你见过这个 you want to pause,你需要停下来 take a look at your code,检查下你的代码 and see if there’s potentially a mistake.看看是不是哪里写错了 In my case, the mistake is that I haven’t declared last_name yet.我这里的错误是 我还没有定义last_name这个变量 Remember, we had it up here,记住 我们在前面这里写了 but we’ve commented it.但又注释掉了 Remember, what Susan taught us earlier,记住Susan之前教过我们的 that when you comment out a line of code,如果你注释掉了一行代码 that’s now not going to run it any longer.那程序就无法再继续运行下去了 So let me go ahead and enter in our last name here.所以我再在这里加入last_name变量 It will say,这里写 “Please enter your last name:”Please enter your last name: Now, let’s go ahead and run this and we’ll see what happens.现在我们再来运行一下程序 看看会发生什么 So it asks me “Please enter my first name”. Christopher.它这里说Please enter my first name 输入Christopher “Please enter my last name.” Harrison.Please enter my last name 再输入Harrison Hit “Enter” and now it tells me “Hello Christopher Harrison”.按下Enter键 它会输出Hello Christopher Harrison If we want to go ahead and capitalize everything,如果我们想进一步把所有单词首字母都变为大写 we can go ahead and say capitalize and capitalize.我们可以在这里加入capitalize函数 然后这里也是 Again, we’ll save that.再次保存一下 And now, let’s come back and re-run our code.然后回过头来 再次运行程序 What’s my first name? I’m going to ”shout“ it.我的名字?我要”大声喊”出来 CHRISTOPHER.输入CHRISTOPHER What’s my last name?我的姓氏?输入HARRISON You’ll notice the results here,你会看到这里的结果 that we wind up getting that我们最后得到的 Christopher Harrison with the却是Christopher Harrison just the capital C and the capital H,只有首字母C和H是大写的 because once again, we’ve used that capitalize from before.因为我们再次用到了前面提到的首字母大写函数capitalize Now, if we want to do things like lower, and upper, and so forth,如果我们还想将字符全都小写或者大写 等等 all of those are available to us inside of Python.我们都可以在Python中找到这些函数 Now, if you want to get in and see some of现在如果你想继续深入并 the other things that you could potentially do with Strings,了解一些可用用得到的字符串有关的知识 you can take a look at the docs inside of Python.你可以查阅Python的内部文档 Those are some of the more common ones你会在其中发现一些 that you’ll find yourself reaching for常见的函数 when you get in and start working with Strings inside of Python.而这一切都从与Python中的字符串认真打交道开始
  • 2021-08-24《机器学习python实践》#18 应用我们的K最近邻算法What is going on everybody?大家还好吗? Welcome to part 18 of Machine Learning with Python tutorial series.欢迎来到机器学习Python教程系列的第18集 In this tutorial,在这次的教程中 we’re gonna take the K nearest neighbors algorithm that we wrote.我们将使用自己写的KNN算法 It appears to be working让它可以运行 And then we’re gonna be testing it on some real-world data.然后会用一些实际数据去测试它 We’re gonna use that exact same data set as that breast cancer data set.会用到之前的乳腺癌数据集 And then when we get our accuracy back,然后我们得到返回的准确率后 we’re gonna compare our accuracy to the scikit-learn accuracy to see,会把这个准确率与scikit-learn的进行比较 if we did about the same.看看我们做的是否和它一致 What I,what I want you to think about is,希望大家能想一下 should we or should we not, get either identical or almost identical results我们是否会得到一致的结果 or will the scikit-learn classifier do much better than us还是说在同样参数 比如K=5的情况下 under the same, let’s say “K=5”, parameter.scikit-learn分类器的效果会比我们写的好得多 So think about that, as we go.讲解的过程中好好想一想 So the first thing we knew is we’re gonna clean up some stuff.首先 我们要清理下代码 We’re gonna get rid of this imformation here,我们要把这里的信息去掉 we’re gonna get rid of the Matplotlib stuff.也把Matplotlib相关的去掉 We are not going to be graphing,这次我们不会进行绘图 they are way, too many, which too many dimensions for that one.因为用这个数据来画图会有太多维度 Also, how could we know we have numpy?还有 会不会用到numpy? We’re gonna add after collections, we’re gonna bring in “import pandas as pd”.现在在collections下面加上 import pandas as pd And we are also import random,同时也写一下import random pandas so we can load in that data set,pandas可以将这个数据集载入 random, so you can shuffle it, shuffle that data set.random可以将这个数据集随机打散 Because we’re not using scikit-learn at all here.因为这里完全没有使用scikit-learn We’re doing this ourselves from scratch. Okay.我们会重头开始实现它的功能 好 So, except for the pandas part.除了pandas这部分 That’s good or that would take way too long,因为它很好用 否则我们会在这儿花上很多时间 but the algorithm.除了实现算法 Okay. Anyway. No one is amused.好 没有人觉得好笑 Anyway, we’ll get rid of that too.好了 我们也要把它去掉 So it’s just a function in the imports.这是我们引入的函数 So here, the first thing we gonna do is “df = pd.read_csv( )”, Oops, csv.这里 首先写df = pd.read_Csv( ) 哦 是csv And don’t forget that “csv”, let me just copy and paste.不要忘记csv文件 复制粘贴一下文件名 It’s that “breast-cancer-wisconsin. data”.它就是breast-cancer-wisconsin.data and no forget the “.txt” like I did that one time.并且不要忘记.txt后缀名 像我之前就有一次 Now we’re gonna do “df.replace”,现在我们写 df.replace of course, just like before we get rid of the question marks,当然 和我们之前去除问号的方法相同 and we’ll replace that with -99999.把它们替换成-99999 Now that you understand K nearest neighbors,既然你已经理解了KNN算法 hopefully you understand what I was explaining before希望大家已经理解了我之前的课 about that’s significant outlier that, that distance is quite large.关于那个非常重要的异常点的 它距离中心很远 So chances are under these circumstances.在这种情况下 这个点是可能被随机分类 The only time something would compare to something like that只有一个点和另一个点共享同一缺失点时 is if they shared a missing data point.会出现这种情况 Anyway. But we’ll keep it there anyways.好吧 但是我们会继续把它留下 Oh and we need “inplace =True”,哦 同时我们还需要 inplace =True see df, for place, “inplace = True”.为了替换df inplace = True Now we’re gonna “df.drop”, and we’re dropping the “[id]”现在df.drop 同时删除名为id的列 Same reason as before that’s worthless column.和之前的原因一样 这列没有用 If you recall, accuracy went down to like 56 or something percent or was it 51?如果你还记得 准确率下降到56%左右还是51%? I can remember it.我记得这个数字 It’s very close to, you know, a coin toss.它非常接近你们所知的抛硬币的概率 So, a big deal there.这里比较重要 “full_data”, we’re gonna say is “df.astype(float).values.tolist( )”,full_data = df.astype(float).values.tolist( ) and the reason I’m doing this is for some reason.做这步是有原因的 This dataframe, like if I go “print”, I will say “print(df. head( ))”.这个dataframe 我打印一下它 print(df. head( )) And I just comment this out for now.我现在先把这行注释掉 Hopefully we will get what I’m trying to show you.希望会得到我想要展示给你们的东西 I’m not seeing it, but it exists.我没有看到它 但是它确实存在 For some reason, some of these were coming through as quotes,因为某些原因 部分数据带引号显示了 maybe because I’ve updated, maybe it won’t.可能因为我更新了数据 可能没有 But I’m pretty sure it will.但是我很确定它能展示 So we just wanna make sure that we’ve converted to float.所以我们需要确认一下 把数据都转换成浮点型 Everything in this dataframe ought to be an int or a float.这个dataframe的所有数据必须是整型或者浮点型 It happens to most.大多都是这两种类型 Everything here will be int.这里所有数据都会是整型的 But if you want to reuse this code,但如果你想复用代码 it would need to be float, most likely.需要转换成浮点型 它适用于大多数场景 So anyway, we’re gonna convert it to a float.所以我们把数据转成浮点型 And then “.values.tolist( )”.接下去 .values.tolist( ) So now, we’ve got the data.现在我们已经有了数据 Now we’re gonna shuffle the data,来随机一下数据 and now keep in mind, in this case, we can shuffle the data,同时记住 在这个案例里可以随机数据 because we’ve done is we’ve converted this to a list of lists.是因为我们已经把数据转成了一个列表的列表 So for example, let me just “print(full_data)”,比如 print(full_data) I will do the first 10.我会先输出前10个 I think I hate run.我想我讨厌执行程序 Here we go. Right, ok.我们开始吧 好 So as you can see, there’s the first elements, and keep in mind.所以正如你们看到的 这是第一个列表里的元素 记一下它们 The 2 is, if I recall right, benign and a 4 would be malignart,如果我没记错的话 2表示良性 然后4代表着恶性 but I don’t see a 4 at the moment.但是目前我还没看到4 And, just let me do this, real quick.我做一下这个 非常快 I just want, you don’t have to follow this, I just want to see, because I knew this.我想检查一下数据 好了 Yes. So converting it to the list here,哦这里 把它转为列表 you can see like this one is in quotes.可以看到像这个“1” 它在引号里 It’s, It’s been treated as a string for some reason,因为某些原因 它被识别为了字符串 so this column, for whatever reason, is treated as a string.所以不知为什么 这一列也被识别为一个字符串 Probably because it had a question mark in it?可能因为这列数据中有一个问号? But then again, I don’t know because it’s been replaced.但我还是不知道原因 因为它已经被替换掉了 I really don’t know why it’s doing that.我真的不知道为什么会进行这一步 But anyway, that’s why we’re saying “astype(float).values.tolist( )”.这就是为什么写astype(float).values.tolist( ) So anyways, there’s our data.好 这是我们的数据 So, at this point, we can shuffle this data,现在我们可以随机这个数据了 and we are not losing the relationship of the features to label.同时还不会丢失特征和它们对应标签的关系 It’s all part of the same list, right?它们在同一个列表里 没错吧? So we can shuffle this and not lose anything.所以可以随机并且不丢失任何信息 So now we’re gonna say “random.shuffle(full_data)”,现在 random.shuffle(full_data) And just to show, “print”, let’s do “print(full_data)”.为了显示 写print(full_data) We’ll do it to 5,我们输出前5个 and then we’re print full data again to 5 after 20 pound signs.然后在20个星号后面 会再做一次 输出前5个数据 Just to exemplify something.只是为了展示一些东西 So, I just wanted to show that shuffle applies, 这里我只想展示下 shuffle函数生效了 And you have not to redefine.你不需要重新定义它 So the first one starts with 5,1,1,1,2,第一个列表是5 1 1 1 2 and this one is 5,2,3 and so on,这个列表的开始是5 2 3 …… so the shuffle works.所以suffle函数工作正常 That was something that always confused me initially,这是我刚开始使用时一直困惑我的 I would always try to do the following,我总是尝试执行下面的语句 I would try to redefined the variable like “full_data=random.shuffle(full_data)” .我会重新定义变量 full_data=random.shuffle(full_data) That’s, that’s not how it, how it works, anyway.这不是正确的使用方法 So that, so we’ve shuffle the data now.现在我们已经随机了数据 And this is gonna be our version of train test splits.接下来我们将得到训练集和测试集 In a really high quality code.用高质量的代码编写 So we’re gonna say “test _size = 0.2”,test_size = 0.2 and then we’re gonna say the “train_set = {2 : [ ], 4 : [ ]}”.然后 train_set = {2 : [ ], 4 : [ ]} And then “test_set = {2 : [ ], }”,然后 test_set = {2:[ ], } we should just copy this 4 colon empty list.将这个”4:[ ]”复制到括号里 Anyway, train_set, test_set,train_set test_set and then we’re gonna say “train_data = full_data”接下来写train_data = full_data Ops, not parentheses, brackets,哦 不是小括号 中括号 “[ :-int (test_size * len(full_data) )]”.[ :-int (test_size * len(full_data) )] So we’re just, we’re multiplying the whole test size 0.2.现在把整个测试集大小 乘以0.2 We’re using that to create an index value,我们将用它创建一个索引数值 and we’re just slicing it based on that index value.然后用索引数值对测试集做切片 We’ve converted it to an int.它已经转换成了整型 So it’s a whole number and all that found stuff.所以是一个整数 这是用它找到的所有数据 So we’ve done that. And let’s just copy this, paste.搞定了 只要复制这段代码 粘贴 And now, rather than colon minus,这里不是[ : -int( ) ] it would just be, a minus int, minus, then basically to, let’s say to, here.而是[-int( ) : ] So this would be everything up to the last 20% of data.它会选取前80%的数据 And then this will be test, we need to rename this.然后这个数据会成为测试集 我们需要重命名一下 Test data would be the last 20% of the data.测试数据将是后20%的数据 Okay? So now, so we’ve shuffle the data, we’ve sliced the data.好了? 我们已经随机了数据 也把数据切片了 And now what we need to do is populate the dictionaries,现在要做的就是构建字典 because we built this to want a dictionary.因为我们需要一个字典来构建这个函数 So now we’re gonna populate these dictionaries,我们需要把这些变成字典 and populating them super quick and easy,做这个非常快而且容易 because all we have to do is following.因为只需要写以下代码 So we’re gonna say “for i in train_data”,for i in train_data we could make a one-line for loop here,这里也可以只用一行代码 就实现for循环 we really ought to, but I’m not gonna.确实应该这么写 但是我没有 “train_set”,i, basically this will be “i[-1]”.train_set[ ] 一般里面是i[-1] And what are we doing here?这里做了什么? So we’re saying “train_set[i[-1]]”, which is negative first element in those lists,写的是train_set[i[-1]] 它是这些列表中的倒数第一个元素 Remember the last column is the class column.回想一下 最后一列是类别列 That’s why we’re using negative one, that’s the last value.这就是用-1的原因 它是最后一个值 So that is either a 2 or a 4, right?所以它的值是2或者4 是吧? And recall 2 is benign, 4 is maligant.回忆一下 2代表正样本 4代表负样本 So that’s how we’re identifying which one of these in the dictionary we want to be a part of.这就是如何在字典中识别出想要的数据 So “train_set[i[-1]].append[i[:-1]]”train_set[i[-1]].append[i[:-1]] So now, we’re appending lists into this list,这里我们已经将所有列表 添加到了这一列表中 and that list is elements up to the last element.被添加的列表取的是 每列的最后一个数 So again, you wouldn’t want to have one of the attributes being the class,另外 你并不想要某个分类占大多数 because you will get it right every time most likely.因为这极可能导致你的结果都是正确的 K nearest neighbors actually might not.KNN算法实际上并不会如此 But yeah, you don’t wanna do that.但是你并不想发生这种情况 So now, we’ve done that.我们搞定了 Now we need to do is basically the exact same thing only for the test data.现在只要对测试数据做同样的事情 so let’s take this copy, paste, change “train” to “test”, “train_set” to “test”,复制 黏贴 把train改成test train_set改为test_set And you’re good.你做得很好 Now, and again, you could make this one line,和之前一样 也可以把这段代码写成一行 but I didn’t want to do that但是我并不想那么做 simply because of the “i[-1]” , that whole stuff that was kind of confusing probably.因为i[-1] 这部分可能会让人有点困惑 So anyways, we’re done with that.我们完成了 Oops, what has happend? Come down here.哦 发生了什么? 拉下来到这里 So we’ve populated our dictionaries.我们已经构建了字典 So what’s left? Really nothing.那还剩什么没做? 实在没什么了 We just need to pass the information through the K nearest neighbors.我们只需要把数据传到KNN算法中 So basically what we’re gonna say is, we’re gonna say, let’s measure.一般这里 我们需要计数 We’ll say “correct = 0” and “total = 0”, correct = 0 和 total = 0 and we’re gonna create a simple counter here.会创建一个简单的计数器 We’re gonna say “for group in test_set”.接下来for group in test_set What do we want to do?我们接下来做什么? We’re gonna say “for data in test_set[group]”.for data in test_set[group] So for each group in the test set, so this is “test_set”,当group在这个test_set时 这是test_set so for each of these 2 and 4, we’re testing these.对每组数据选取2:[ ], 4:[ ] 我们要测试这些数据 And then we’re going to say “for data in test_set[group]”.然后 for data in test_set[group] So just that list of features, right?data是特征列表 是吧? So that’s what we’re about to feed through the “predict”, and we’re doing this just.这就是要传给predict的 我们刚才做了这步 So “predict” is these lists from the test set, right?所以predict是从测试集取出的一堆列表 对吧? And then as you might be able to guess what we’re going to pass through data,接下来你可能已经猜到了 我们会传给data什么 which we goes here,就是这行代码 which we iterate every single point and calculate the distance,它遍历了每个点并计算了它们间的距离 is going to be the dictionary from the train_set, okay?所以data是一个从train_set得到的字典 对吧? So “for data in test_set[group]”,for data in test_set[group] we’re gonna say “vote = K _nearest _neighbors( )”,然后vote = K _nearest _neighbors( ) and we pass “train_set”.然后把train_set写到里面 That data, which is the features, and we’re gonna say “k = 5”,再传入data 它是特征数据 然后写k = 5 Simply because if you look at the scikit-learn documentation for K nearest neighbors,如果你看一下scikit-learn中关于KNN的说明 they’re using the default value 5,会发现他们将5设为默认值 so we’re gonna copy that.所以我们也复制这个值 Then we are good.一切进行顺利 All we have to ask at this point is to know, if we were right or wrong.现在只要知道我们写的是对还是错 Is “if group == vote”, right?写if group == vote 对吧? If the group that they came from the test_set,如果来自test_set的group because the test set that we know what the answer is.因为test_set中是我们已经知道的结果 So if that group is equal to the vote that we got from our K Nearest Neighbors classifier.所以如果这个group等于我们用KNN得到的vote Congratulations! Plus equals one for you.恭喜! 为你加上1 Otherwise, we’re also, we’ve need to do is “total += 1”.此外 我们还需做 total += 1 Okay. So, now, we’re bascially done.好的 已经基本上完成了 So now we would just “print”, maybe we would say “‘Accuracy:’,”现在我们可能需要显示一下‘Accuracy:’, and then accuracy is just the “correct /total”.准确率是correct /total So let’s save and run that, and see if we get any errors.我们保存运行一下 看看是否有什么错误 Oh, we shouldn’t be printing this out.噢 不能把这个全部输出 Oh, this is disgusting.噢 看着真恶心 Ok, it went pretty quick, anyway.好吧 输出的非常快 “Accuracy: 0.978”, so 97.8% accuracy.Accuracy: 0.978 所以准确率为97.8% Boom, look at us. Ok.好棒!看我们做的!好 I’m gonna, I’m gonna become died out.我感觉都快要死了 OK, so, so that’s we’ve applied it,好 我们已经应用了这个算法 and now what we want to do is compare that.现在我们想要比较一下算法 Let’s run it one more time, without nasty output.让我再运行一次 去掉烦人的输出 We’re going to compare that, so we ran it again. 95.6% accuracy.我们想要作算法比较 所以再运行一次 准确率95.6% OK, so now what I want to do is have us to compare this to a scikit-learn.好 现在我想把它和scikit-learn的进行比较 So we’re gonna do that.让我们开始吧 And then also we’re going to calculate confidence,我们还要计算置信度 and we’re going to do that in the next tutorial.这会在下一次教程中进行演示 So if you have any questions, comments, concerns, whatever up to this point,如果你对此有任何问题 评论 关心 无论什么 feel free to leave them below.请随时在下面留言 Otherwise the next trial that’s what we’re gonna do.另外 下次教程我们会继续接下去的任务 Also, thanks for watching.同时 感谢收看 Thanks for all these supports, subscriptions until next time.感谢支持和订阅 我们下次再见
  • 2021-08-249/44 字符串的概念使用字符串 Let’s get in and take a look at让我们开始学习 probably one of the most common things编程中可能最常用到的 that you’ll be doing in programming,一样事物 and that is working with strings.那就是使用字符串 Now, when it comes to strings and actually just variables,好 字符串实际上是变量 In Python, it’s relatively straightforward to take a string在Python里创建字符串并将其储存在变量中 and store it inside of a variable.相对比较简单 Now as a real quick aside,顺便提一下 if you’re not already familiar with variables,如果你对变量还不太熟悉 variables windup acting as placeholders变量在代码中 inside of your code for some values.充当某些值的占位符 So in my case,在此示例中 first_name is going to wind up being “Christopher”.first_name赋值为“Christopher” Now, one of the things that will make Python unique is the factPython独特的一点在于 that you don’t have to use你不需要使用 any form of a keyword or otherwise to declare a variable.任何形式的关键字或其他语句来声明变量 You just simply give it a name,你只要简单起个名字 set it to some value,给它赋一个值 voila, you’ve created a variable that’s all there is to it.看吧 你已经创建好一个完整的变量了 You’ll also notice my string literal over here on the right-hand side,你会发现对于在右边的字符串 and you’ll notice that I’m using single quotes here.我使用的是单引号 Again, you can use single quotes or double quotes,强调一下 你使用单引号或双引号都可以 doesn’t matter which, but you want to be consistent.哪个都行 但你要保持一致 For me personally, I really like using single quotes.我自己比较喜欢用单引号 I just think it reads a little bit better我只是认为它看起来 than having double-quotes.比双引号好一点 Maybe I don’t even have a good reason as to why.也许我也说不出一个更好的理由 I just think it looks better with我只是认为单引号看起来 single quotes rather than double-quotes.比双引号好一点 If you want to use double-quotes,如果你想用双引号 you’re not going to hurt my feelings at all.我一点也不介意 It’s your code.因为那是你的代码 Just simply again be consistent.只是简单强调一下 要保持一致 Now if you want to take two strings and combine them together,如果你想把两个字符串连在一起 you can do that by just using the plus operator.用加号运算符就可以了 So you’ll notice in my code可以看到在我的代码中 that I’ve got my first name and my last name,我有first_name和last_name and then you’ll notice that I’m going ahead我们写完这个后继续 and I’m calling Print,然后我调用了print and I’m combining that first name and last name together你会发现first_name和last_name之间 just by using that little plus sign我只用了一个小小的加号 that you see right there.就把它们连起来了 Let me just grab my pen again我现在用下钢笔工具 so that way I can circle it.这样我可以把它们圈出来 That little plus sign will bring that together.这个小加号可以把它们连起来 Now that works with both variables, as well as with string literals.这对于变量和字符串都适用 So if I want to bring together所以如果我想 a string literal of “Hello” and a couple of spaces在字符串里加上字符串 ‘Hello’ along with my strings,还有一些空格 then I can also do that just by using我也同样可以使用 that exact same plus operator.这个加号运算符 One last little side-note on our variables.关于变量顺便再说一点 You’ll notice that we’re using你会发现我们用到 a word, an underscore, and then a word.一个单词 一个下划线再加一个单词 When you’re creating variable names,当你给变量起名字的时候 you want to make sure that they’re nice and clear,你要确保名字简单易懂 and if they are going to need multiple words to make it clear,如果需要几个单词才能解释清楚的话 convention in Python is always lowercase,在Python中 一般是用像这样的 and with that underscore that you see there.小写字母和下划线 Now, if you want to modify a string,现在 你如果想修改字符串 maybe convert everything into uppercase letters, lowercase letters,比如将所有字母转成大写或小写 capitalize just the first word,首字母大写 or potentially count all of the instances of a particular string.或者是获得某个特定字符串的数量 In this case the letter a,比如这里的字母a you can do that by using upper, lower,你可以分别使用upper lower capitalize, encount, respectively, and down belowcapitalize count 然后运行 you’ll see the results of doing each one of those.你会看到每一条语句的运行结果 Then finally, if you want to bring all of that together,最后 如果你想把所有的连在一起 you can still do that string concatenation你可以按照我们之前说的方法 just the way that we’ve seen previously,完成字符串连接 and even bring in values from the outside world.甚至从其他地方加一些值进去 So in this little code example here,所以在这个代码示例中 what you’re going to notice is你能看到的是 we’re going to read in the first name from the input window.我们首先会读取输入框中的first_name So whatever it is that the user types in will bring that in.无论用户输入什么都会被存储 We’ll do the exact same thinglast_name也是一样 with the last_name and print that out,然后把它们输出 and then you’ll notice that接下来你会看到 we can properly capitalize everything我们使用capitalize函数 by utilizing that capitalize function that you see there,使所有单词首字母大写 and you’ll notice the results,可以看到结果中 that if I type in Christopher in all uppercase,我输入的Christopher全是大写 Harrison in all uppercase,Harrison也全是大写 and then I wind up calling capitalize,但紧接着我调用了capitalize it will just capitalize that first letter.就只有首字母大写了 So that’s how you can get in这就是怎么在Python里 and start working with some neat little strings inside of Python.一些处理简单字符串的方法了 Now what I want to do is go off, do a little bit of a demo,接下来我想演示一些例子 and then we’ll actually come back然后我们回过头来 and take a look at one last advanced configuration that you can do with strings.学习最后一种跟字符串相关的高级配置
  • 2021-08-24OpenAI GPT-2:几乎完美的文本生成器Dear Fellow Scholars, this is 2 minutes paper Károly Zsolnai-Fehér学者朋友们大家好 这里是两分钟论文 This is an incredible paper from OpenAI这是Open Al最近发表的一篇超棒的论文 in which the goal is to teach AI to read a piece of text论文的目的是 让人工智能学习文本阅读 and perform common nature language processing operation.并进行常规的自然语言处理应用 For instance, answering questions, completing text,例如 回答问题 文本补充 reading comprehension, summarization, and more.阅读理解 归纳总结等等 And not only that,不仅如此 but additionally, the AI has to be able to perform these tasks人工智能还需要在尽可能少的监督下 with as little supervision as possible.处理这些任务 This means that we seek to unleash the algorithm that they call GPT-2这就意味着我们需要运用一种叫GPT-2的算法 to read the internet and learn the intricacies of our language by itself.来读取互联网 并主动学习我们错综复杂的语言 To perform this, of course, we need a lot of training data当然 为了达到目的 我们需要大量的训练数据 and here, the AI reads 40 gigabytes of internet text,文章里的人工智能已经读取了400亿字节的互联网文本 which is 40 gigs of non-binary plaintext data,包括40GB的非二进制纯文本数据 which is a stupendously large amount of text.这是极大的文本量 It is always hard to put these big numbers in context,将如此大量的文本融入到具体语境是很困难的 so as an example, to train similar text completion algorithms,举个例子 为了训练类似的文本处理算法 AI people typically reach out to a text file人工智能学者通常会用一个 containing every significant work of Shakespeare himself包含莎士比亚所有经典著作的文件 and this file is approximately 5 megabytes.文件大约有5MB so the 40 gigabytes basically means an amount of text所以40GB相当于 that is 8000 times the size of Shakespeare’s works.莎士比亚作品集大小的8000倍 That’s a lot of text这是非常大的文本量 And now, let’s have a look at how it fares with the text completion part.现在 让我们看看它是如何进行文本补充的 This part was written by a human,quoting:这部分由人类撰写 内容是: “In a shocking finding,“震惊 scientist discovered a herd of unicorns living in a remote科学家发现了生活在安第斯山脉 previously unexplored valley, in the Andes Mountains.人迹罕至的峡谷的一群独角兽 even more surprising to researchers was the fact that the unicorns spoke perfect English. ”更令科学家震惊的是 独角兽说着非常流利的英语” And the AI continued the text the following way, quoting a short snippet of it:然后由人工智能续写 其中的一小段是 “The scientist named the population, after their distinctive horn, Ovid’s Unicorn.“科学家根据它们独特的角 将其命名为奥维德独角兽 These four-horned, silver-white unicorns were previously unknown to science.”这是科学界第一次发现银白色的四角独角兽” Wow! Now note that this is clearly not perfect.哇!请注意目前的续文并不完美 if there is even such a thing as a perfect continuation, and it took 10 tries,如果一个所谓的完美续文需要十次尝试 which means that the algorithm was run 10 times意味着这个算法会运行十次 and the best result was cherrypicked and recorded here.然后最优解就会被挑选出并记录下来 And despite all of these,尽管这些已经很棒 this is a truly incredible result,更不可思议的是 especially given that the algorithm learns on its own.这个算法能自己学习 After giving it a piece of text,在给它一篇文章后 it can also answer questions in a quite competent manner.人工智能也能按要求回答问题 Worry not, later in this video,别着急 接下来的视频中 I will show you more of these examples and likely talk over them我将向您展示更多的相关例子 并尽量详尽地解释 so if you are curious,所以如果您好奇的话 feel free to pause the video在阅读提示和文本补全时 while you read the prompts and their completions.可以随时暂停本视频 The validation part of the paper reveals论文的验证部分发现 that this method is able to achieve state-of-the-art results这个算法能够在几种语言的建模任务上 on several language modeling tasks,达到最先进的水平 and you can see here你们可以看到这里 that we still shouldn’t expect it to match a human in terms of reading comprehension,就阅读理解这种问答测试而言 which is the question answering test.我们不应期望人工智能比肩人类 More on that in a moment.稍后会做详细介绍 So, there are plenty of natural language processing algorithms out there还有许多的自然语言处理算法 that can perform some of these tasks,可以完成部分上述任务 in fact, some articles already stated that there is not much new here,实际上许多文章都表示这并不新颖 it’s just the same problem,只是用更普遍的方法 but stated in a more general manner, and with more compute.和更复杂的计算解决同一个问题 A ha! It is not the first time that this happens.啊哈 不是第一次出现这个情况了 Remember our video by the name “The Bitter lesson”?还记得我们名为《沉痛教训》的视频吗 I’ve put a link to it in the video description,我在视频描述里放上了视频链接 but in case you missed it,但以防你没看到 let me quote how Richard Sutton addressed his situation:让我来引用理查德·萨顿对这种情况的表述 “The bitter lesson is based on the historical observations“苦痛的教训基于历史观察发现 that 1 ) AI researchers have often tried to build knowledge into their agents,包括1)AI研究者经常尝试在算法中创建自己的知识体系 2 ) this always helps in the short term 2)这种方法通常在短期内 and is personally satisfying to the researcher,达到让研究者满意的结果 but 3 ) in the long run it plateaus但是 3)长期则效果不明显 and even inhibits further progress甚至影响更深入的进展 and 4 ) breakthrough progress eventually arrives by an opposing approach4)最终的突破往往来自于相反的 based on scaling computation by search and learning.基于探索和学习的标度计算方法 The eventual success is tinged with bitterness,最终的成功 略带苦涩 and often incompletely digested,且常常不被完全接受 because it success over a favored, human-centric approach.“因为它的成功颠覆了以人类为中心的主流方法” So what is the big lesson here?那么这里的重点是什么呢? Why is GPT-2 so interesting?为什么GPT-2如此有趣? Well, big lesson number one is that第一点 this is one of the clearer cases of what the quote was talking about,也是引言中最明显的一点 where we can do a whole lot given a lot of data and compute power,假设我们有大量数据和运算能力 可以完成很多任务 and we don’t need to insert too much additional knowledge into our algorithms.且并不需要在算法中加入太多额外的内容 And lesson number two,第二点是 as a result, this algorithm becomes quite general这个算法因此变得具有普适性 So it can perform more tasks than most other techniques.比起其他技术 它能完成更多的任务 This is an amazing value proposition.这是一个很棒的价值主张 I will also add我还想提出来 that not every learning technique scales well when we add more compute,算力提升 并不能有效提升每种学习技术的适应性 in fact,you can see here yourself事实上 你可以看到 that even GPT-2 plateaus on the summarization task.即使是GPT-2也会在归纳任务上受阻 Making sure that these learning algorithms scale well保证这些学习算法的适应性 is a great contribution in and of itself本身就是一个巨大的贡献 and should not be taken for granted.不应该被轻视 There has been a fair bit of discussion on目前已经有很多关于OpenAI whether OpenAI should publish the entirety of this model.是否应该公开完整模型的讨论 They opted to release a smaller part of the source code最终 他们选择发布源代码的一小部分 and noted that they are aware注意他们确实意识到 that the full model could be used for nefarious purposes.一些人可能出于非法目的利用完整的模型 Why did they do this?OpenAI为什么这样做 What is the matter with everyone having an AI每个人都拥有一个仅次于人类阅读理解水平的 with a subhuman-level reading comprehension?人工智能助手有什么问题吗? Well, so far, we have only talked about quality.目前我们只讨论了性能 But another key part is quantity.但另一个关键部分是数量 And boy, are these learning methods superhuman in terms of quantity这些学习方法在数量方面是超乎常人的吗? just imagine that they can write articles想象一下 它们可以 with a chosen topic and sentiment all day long.根据设定的主题和情感创作一整天 and much quicker than human beings速度比人类还快 Also note that the blueprint of the algorithm is described in the paper,还请注意作者在论文中描绘了算法的蓝图 and a top-tier research group is expected to be able to reproduce it.期望顶尖的研究团队能将其再现 So does one release the full source code and models or not?这算不算泄露了完整的源代码和模型呢? This is a quite difficult question:这是一个很难界定的问题: we need to keep publishing both papers and source code to advance science,推动科学的发展需要论文和源代码的不断发表 but we also have to find new ways to do it但我们必须在具备职业道德的手段下 in an ethical manner.寻找新的实现方法 This needs more discussion这需要进一步讨论 and would definitely be worthy of a conference-style meeting, or more.值得召开一个或多个专题研讨会 There is so much to talk about,有太多需要讨论的问题 and so far,we have really only scratched the surface,目前 我们也只是接触到了皮毛 So make sure to have a look in the video description,所以请您务必看一下视频简介 I left a link to the paper我附上了这篇论文的链接 and some more super interesting reading materials for you.和一些非常有趣的阅读材料 Make sure to check them out.记得查看 Also just a quick comment on另外 快速解释一下 why this video came so late after the paper has appeared.为什么本视频在论文发表这么久后才发布 Since there were a lot of feelings and intense discussion on因为有许多反应和激烈的争论 whether the algorithm should be published or not,聚焦于是否应该公布该算法 I was looking to wait until the dust settles我一直在等尘埃落定后 and there is enough information out there有足够的信息 to create a sufficiently informed video for you.来为你们制作一个更全面的视频 This of course means that we are late to the party当然 这就意味着我们错过了热度 and missed out on the whole lot of views and revenue.并且失去了许多播放量和收入 But that’s okay.但是没关系 In fact, that’s what we’ll keep doing going forward事实上 我们长久以来的宗旨是 to make sure you get the highest quality information that I can provide.为观众朋友们提供最优质的信息 If you have enjoyed this episode如果你们喜欢这期节目 and would like to help us,并且愿意帮助我们 please consider supporting us on Patreon.请考虑在Patreon上支持我们 Remember our motto,记住我们的座右铭 a dollar a month is almost nothing,一个月一美元算不得什么 but it keeps the papers coming.但是它可以让论文持续更新 And there are hundreds of papers on my reading list.而我的阅读清单有几百篇论文 As always, we are available through Patreon.com/TwoMinutePapers,老规矩 点击Patreon.com/TwoMinutePapers即可送出支持 and the link is also available in the video description.你也可以在视频简介里找到链接 Thanks for watching and for your generous support,感谢您的观看和支持 and see you next time我们下期再见!

译学馆所有视频和图片来自互联网版权归原创者所有。