Bubble sort in x86 (masm32), the sort I wrote doesn't work

I'm trying to write a bubble sort in x86 (masm32). The sort doesn't work. I've tested some of the code and it appears to be messed up in the compare and swap sections. For some reason, the compare function always assigns 2 to EAX. If I can figure out why I can get the program working. Thanks for your help in advance. .data aa DWORD 10 DUP(5, 7, 6, 1, 4, 3, 9, 2, 10, 8) count DWORD 0 ; DB 8-bits, DW 16-bit, DWORD 32, WORD 16 BYTE 8 .code ; Tell

X86 Stack Parameter Offset Issue MASM

I'm new to MASM and I'm having a bit of trouble with using indirect offsets and passing arguments on the stack. I have a sorted array and it's size that I am passing to a procedure via the stack. I want to print the first and last element of the array. I push the two arguments the the stack, the offset of the first element of the array and the number of elements in the array. The number of elements in the array is correct but when I try and access the next position on the stack [the arraysize

Re-package a x86 installer for 64-bit Windows without access to an older Windows OS

I have an older Windows 32 program that needs to be repackaged for Windows 8. All I have is the installer, which does not run in compatibility mode, although the program itself does run in compatibility mode once installed (this was tested by users, not me), so the app itself does not need to be remade, just the installer. I currently don't have access to a 32-bit Windows machine. Is there a way to unpack and repackage this installer from Windows 8? I am not sure which installer this is, eve

X86 Understanding performance and behaviour of clwb instruction

I am trying to understand the read/write performance of clwb instruction and test how it varies in case of a write to a cache line against when I am only reading it. I expect that for the write case, the time taken should be higher than the read case. To test the same, here is a small code snippet that I am running on Intel Xeon CPU (skylake) and using non-volatile memory (NVM) for the read write store /* nvm_alloc allocates memory on NVM */ uint64_t *array = (uint64_t *) nvm_alloc(pool, 512);

X86 In full virtualization context, what happens on guest OS system calls?

Getting confused about protection rings, especially on the context of virtualization, can someone help to demystify the following: Is my following understanding correct: protection rings (or similar concepts for other non x86 chips) are enforced on the hardware (circut?) level regardless of the instructions in that, if an operation of the instruction requires higher privilege than the current CPU mode, it triggers an interruption. If the system call set the privilege level to ring0 for the code

X86 How does MSI-X triggers interrupt handlers? Is there a need to poll the chosen memory address?

I have a small kernel which is booted with UEFI. I'm using QEMU for virtualization. I want to write a xHCI driver to support USB keyboards in my kernel. I'm having trouble to find concise and clear information. I "found" the xHCI in my kernel. I have a pointer to its PCI configuration space. It is MSI-X capable. I want to use MSI-X but I'm having trouble to understand how that works with the xHCI and USB. My problem is that normally osdev.org is quite informational and has the basis I

X86 How has CPU architecture evolution affected virtual function call performance?

Years ago I was learning about x86 assembler, CPU pipelining, cache misses, branch prediction, and all that jazz. It was a tale of two halves. I read about all the wonderful advantages of the lengthy pipelines in the processor viz instruction reordering, cache preloading, dependency interleaving, etc. The downside was that any deviation for the norm was enormously costly. For example, IIRC a certain AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a functio

Can I Load Resource Library (DLL) for x64 build in x86 app?

I want to load unired.dll which used in default Windows Printer driver resource file. I can load unires.dll for Windows Vista x86. It's located in C:\Windows\System32\spool\drivers\w32x86\3 But now I use Windows 7 Pro x64. So the same name unires.dll which is located in C:\Windows\System32\spool\drivers\x64\3 cannot be load. By the following code,GetLastError() returns 193 Is it possible? or impossible ? I use Visual Studio 2005 Pro. try build x64 and x86 but each of them failed. TCHAR l

X86 A faster integer SSE unalligned load that's rarely used

I would like to know more about the _mm_lddqu_si128intrinsic (lddqu instruction since SSE3) particularly compared with the _mm_loadu_si128 intrinsic (movdqu instruction since SSE2) . I only discovered _mm_lddqu_si128 today. The intel intrinsic guide says this intrinsic may perform better than _mm_loadu_si128 when the data crosses a cache line boundary and a comment says it will perform better under certain circumstances, but never perform worse. So why is it not used more (SSE3 is

X86 kernel in c inline assembly

Hi i have once again a problem i try to write a kernel with the GNU assembly language but i have some trouble. My kernel file versuch.c look like that: void kprintf( char hello[]) { /*char* video=(char*)0xb8000; for(int i=0;hello[i]!='\0';i++) { video[i*2]=hello[i]; video[i*2+1]=0x06; }*/ asm("mov %0,%%si"::""(hello)); //asm("mov 'a',%al;"); asm("call Schreibe;" "Schreibe:;" "lodsb;" "cmp $0x00,%al;" "je Schreibeende;" "m

X86 Why would you OR a value with itself before a call to jnz?

I'm reading some code that does the following: OR al, al JNZ loc_123456 If I am reading this correctly, the OR command just sets a flag bit, which is then tested that the register doesn't have any nonzero values. Why would you write it that way, and not CMP al, 0 JNE loc_123456 which is much more readable? I assume the hardware is doing something similar in each case...

X86 Does a processor that supports SSE4 support SSSE3 instructions?

I am developing a hardware platform that requires the SSSE3 instruction set. When looking at a processor such as the Intel Atom® x5-Z8350 the datasheet says it has support for SSE4.1 and SSE4.2. Would this allow software written for SSSE3 instructions to function? I believe this question is slightly different than this question as it never explicitly says SSE4 is a superset of SSSE3. It only says AVX is a superset.

X86 How does an OS "deal in" virtual addresses and physical addresses

From what I understand from Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D): System Programming Guide, paging is enabled on a per-core basis: either off, or on. As such, how does an OS "deal" in physical addresses? Most immediately, how would it manage the page table structures, and associated page frames? Of the two things I can think of: The OS would actually disable paging to perform these types of tasks, then reenable it. The OS would, during

X86 how to understand if a CPU support ECC?

I have an old pc powered by an Intel Core2 Quad CPU@2.4GHz (both Bios and linux dmidecode don't tell more than that), I can add that the CPU belongs to: Family 6, Model 15, Stepping 7(LGA775 socket). The motherboard does support ECC but I am wondering if the CPU does it also. I can see that the result of the command dmidecode --t cache gives information about L1,L2,L3 and I can see that on L1 and L2 "Error Correction Type: Single-bit ECC" while L3 "Error Correction Type: Unknown". Given what I

X86 Are there multiple LDTs?

The following Wikibooks page states: The GDT contains pointers to each LDT. I'm currently learning about segmentation, and this implies that there are multiple LDTs. As far as I can tell there is only one: multiple references I've read refer to "the LDT", implying there is only one. Is the referenced page correct in its implication? Did it mean "LDT entry"?

X86 What causes the DTLB_LOAD_MISSES.WALK_* performance events to occur?

Consider the following loop: .loop: add rsi, STRIDE mov eax, dword [rsi] dec ebp jg .loop where STRIDE is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the only loop in the code. That is, it's not being initialized or touched before the loop. On Linux, all of the 4K virtual pages of the buffer will be mapped on-demand to the same physical page. I've ran this code for all possible strides in the range

X86 Are there any scenario where VMM will fail to inject an interrupt into a guest on Interrupt window exit?

I am working on a custom Type 2 Hypervisor. My question is related to interrupt injection of emulated devices in the guest. Scenario: The guest did some vmexit, before the next vmresume VMM found out that there is a pending interrupt present in the emulated interrupt controller. VMM requests for Interrupt Window Exit(IWE) on the subsequent vmresume. Once we get an IWE VMM writes the interrupt info into VM-entry interruption-information field 4016H and resumes guest execution. Question: Is it g

Do x86/x64 chips still use microprogramming?

If I understand these two articles, the Intel architecture, at it's lowest level, has transitioned to using RISC instructions, instead of the the traditional CISC instruction set that Intel is known for: http://www.hardwaresecrets.com/article/235/4 http://www.tomshardware.com/reviews/intel,264-6.html If that's the case, then are x86/x64 chips still microprogrammed or does it use hardwired control like traditional RISC chips? I'm going to guess it's still microprogrammed but wanted to verify.

X86 SSE: convert short integer to float

I want to convert an array of unsigned short numbers to float using SSE. Let's say __m128i xVal; // Has 8 16-bit unsigned integers __m128 y1, y2; // 2 xmm registers for 8 float values I want first 4 uint16 in y1 & next 4 uint16 in y2. Need to know which sse intrinsic to use.

X86 Can someone annotate this machine code?

I'm attempting to start learning x86(-64) machine code because sometime in the future I want to write a compiler or JIT compiler (probably the latter first). I've written assembly for a while so I'm not going into this blind, I'm just trying to learn the x86 instruction encoding/format since it seems to be quite complex. I've seen tables and read articles and stuff (as well as some of the intel manuals (the most inhuman document I've ever read)). So I'm kind of starting to understand it so I de

Explanation of x86 legacy instructions

I was reading a book on computer architecture to improve my understanding on microprocessors when I reached a stumbling block that the author didn't bother to explain. The book is concerned with intel processors from the original Pentium upwards. The author never explained what x86 actually meant from processor to processor. I'm finding it difficult to understand because during a discussion on the original Pentium, the author said that one of the drawbacks of the Pentium is the that it assigned

Is there an advantage to putting x86 driver code in rings 1 and 2 instead of 0?

Drivers for monolithic kernels can be in rings 0, 1, or 2 (with microkernels they will be in ring 3 - user ring). Are there any advantages/disadvantages of putting driver code in ring 0 with the kernel or in the "slightly-less" privileged rings 1 and 2? Rings 1 and 2 can still access supervisor pages, but they cannot run some special privileged instructions (if they do, they will raise a General Protection Fault - like with ring 3)

On x86, does enabling paging cause an "unconditional jump" (since EIP is now a virtual address)?

When paging is enabled by setting the paging bit in CR0 to 1, all pointers (including EIP) are now interpreted as virtual rather than physical addresses. Unless the region of memory which the CPU is currently executing from is "identity mapped" (virtual addresses are mapped to identical physical addresses), it seems that this would cause the CPU to do what amounts to an "unconditional jump" -- it should start executing code from a different (physical) address. Does this actually happen? It seem

X86 Address translation in big real mode

I have some questions regarding how address translation happens in big real mode, as http://wiki.osdev.org/Unreal_Mode says Unreal mode consist of breaking the '64Kb' limit of real mode segments, but still keeping 16 bits instruction and segment*16+offset address formation by tweaking the descriptor caches But my question is how the gdt is used in the process or even is it used at all while translation to linear address. If anyone can point to some specification or some other reference fo

X86 Interaction between endian-ness and bitwise comparison with register

I believe this piece of code works, but it really seems like it shouldn't, unless ax is being compared with 0000 1010 instead of 1010 0000. Isn't it not supposed to matter whether data is stored in little or big endian format? Here's the relevant bit of code: mov al, es:100h mov ah, 0 and ax, 0A0h cmp ax, 20h ... Isn't the value is ax something like this: mov al, es:100h ; ax = ???? ???? mov ah, 0 ; ax = 0000 ???? and ax, 0A0h ; ax = 0000 0000 or ax = 0000 ?

X86 If reset vector is hardwired, why start Start Segment Address is present in binary?

I am working on 80186XL processor based embedded system. This system doesn't have any BIOS or OS. If reset vector is hardwired at physical address FFFF0h, then why "Start Segment Address" Record type appears in binary (in Intel hex format). The following are last 4 lines of binary :02000002FFFFFE :05000000EA0000D0FF42 :04000003FFD000002A :00000001FF Regards, Ajith

Printing multiple triangles on MASM x86 (16 bit)

So I am trying to write a program in MASM x86 (8086) that will printout a series of right triangles built of asterisks “”. I am using loops to print out the triangles. I am trying to make each of the triangles 3 to 9 asterisks high and the same number across, but in different configurations. I got it to only print one triangle though. After my 1st triangle is printed, it just keeps looping asterisks "" indefinitely. Here is some of my code: mov ah, 09h ;prints string mov dx, offset input

X86 How do the Conflict Detection instructions make it easier to vectorize loops?

The AVX512CD instruction families are: VPCONFLICT, VPLZCNT and VPBROADCASTM. The Wikipedia section about these instruction says: The instructions in AVX-512 conflict detection (AVX-512CD) are designed to help efficiently calculate conflict-free subsets of elements in loops that normally could not be safely vectorized. What are some examples that show these instruction being useful in vectorizing loops? It would be helpful if answers will include scalar loops and their vectorized count

X86 How to correctly measure IPC (Instructions per cycle) with perf

I wonder how to measure instructions per cycle correctly using perf. As reference: http://www2.engr.arizona.edu/~tosiron/papers/SPEC2017_ISPASS18.pdf used inst_retired.any and cpu_clk_unhalted.ref_tsc for their calculations, and I'm now wondering if this is the correct approach. In comparison, PAPI uses the hardware counters PAPI_TOT_INS and PAPI_TOT_CYC to calculate the IPC. After some measurements I concluded: inst_retired.any:u seems to be the same as PAPI_TOT_INS cpu-cycles seems to be th

X86 VirtualBox - No bootable medium found

There are a lot of question on stackoverflow with the similar title. I read all of them, but none of them answers my problem. This is why I opened this question. I am creating an operating system in assembler and C. I found that I must compile C code to binary format, extract text section and save it as a file, then convert it to ISO, then mount it to virtual optical dive of diskete and then load my OS in VirtualBox. So, that is a lot of work I want to avoid. I don't want to convert my binary f

X86 How to find the index of an element in the AVX vector?

I am trying to write a hardware accelerated hash table using AVX, where each bucket has a fixed size (AVX vector size). The question arose of how to implement a quick search by vector. Incomplited possible solution: example target hash: 2 <1 7 8 9 2 6 3 5> // vector of hashes <2 2 2 2 2 2 2 2> // mask vector of target hash ------------------------ // equality comparison <0 0 0 0 -1 0 0 0> // result of comparison <0 1 2 3 4 5 6 7> //

X86 Who enables the A20 line when booting in pure UEFI?

Is this handled by the UEFI firmware or by for the GRUB grubx64.efi bootloader? I looked at https://wiki.osdev.org/UEFI which claims: UEFI firmware ... also prepares a protected mode environment with flat segmentation and for x86-64 CPUs, a long mode environment with identity-mapped paging. The A20 gate is enabled as well. But could not find any official sources to back up this information. The UEFI specification does not mention this. The linux kernel provides a efi-stub that can act

X86 How Do I Put My Bootloader And Kernel On A USB

I've written a bootloader and Basic kernel as a fun side project while i'm learning 2 stage bootloaders, I want to load my bootloader at sector 1 (or the MBR) of the USB and the Kernel at sector 2. I've compiled both into Bootloader.bin & Kernel.bin using NASM. I just need a little help on actually writing them onto the USB. I have access to both Windows and Linux so any answers a appreciated. Bootloader.asm [BITS 16] [ORG 0x7C00] ResetDisk: XOR AH, AH INT 0x13 JC ResetDisk ReadDisk: MOV

X86 Switching to User-mode using iret

I am writing a small OS that will execute some code in user mode (privilege level 3). From that user level code, I want to call an interrupt back to the OS that prints a message. Right now I don't really care how my interrupt handler takes arguments or anything like that, I really just want an interrupt handler to inform me (the user) that the code has executed. My question is: how do I run code in user mode? I have a function that sets up a Local Descriptor Table with a code segment and dat

Repeat prefixes and mandatory prefixes in x86

In my quest of writing a small disassembler for linux specific to x86 arch, I'm faced with a small issue. It's with regard to mandatory prefixes and repeat prefixes. Looking at the Intel docs [1], it's said that repeat prefixes are 0xf2 or 0xf3, and mandatory prefixes are 0x66, 0xf2 or 0xf3. There are two instructions which have the following base opcodes: crc32 -- f2 0f 38 f0 (Here, 0xf2 is a mandatory prefix) movbe -- 0f 38 f0 So, the opcodes of a 'movbe' instruction which has to rep

X86 NASM - Integer to String

have a question regarding integer to string conversion in NASM. My question is how do I concatenate the digits such that the output contains all the digits when the loop is terminated, rather than the last calculated digit? Eg. 1234 -> 1234 div 10 = remainder 4 => Buffer = "4" 123 -> 123 div 10 = remainder 3 => Buffer = "3" etc... My program just stores the last digit calculated ('1') and prints that. Here are my code files: #include <stdio.h> #include <stdlib.h> int ma

Is x86 RISC or CISC?

According to Wikipedia, x86 is a CISC design, but I also have heard/read that it is RISC. What is correct? I'd to also like to know why it is CISC or RISC. What determines if a design is RISC or CISC? Is it just the number of machine language instruction a microprocessors has or are there any other characteristics that determine the architecture?

X86 Find and display greatest common divisor from 2 input numbers in nasm

Good day, I have been trying to write a program that does the following: Accept two ASCII numbers from a user (I haven't bothered trying to check the values yet) Convert those ASCII numbers to decimal values. Find the greatest common divisor from those two numbers Display the result I feel as though I've successfully done steps 1-3, although I'm not entirely sure. Here is the code: bits 16 org 0x100; jmp main ; number1_str: db 3,0,0,0,0,'$' number2_str: db 3,0,0,0,0,'$' num1_hex: dw 2 nu

X86 Combination of VMX and Non VMX in multiple CPU's

In a dual or multi-CPU configuration, is it possible for one CPU to be operating in VMX mode(root or non-root) and another (or other) CPUs to be executing in legacy (non-VMX) mode?

X86 How can I get Rebol2 View running on Arch Linux?

There is a 64-bit version of Rebol2/Core available, but not of /View. If I try to execute the binary, it just says the file does not exist. What 32-bit libs do I need to install to get things running on Arch?

X86 Is NOT missing from SSE, AVX?

Is it my imagination, or is a PNOT instruction missing from SSE and AVX? That is, an instruction which flips every bit in the vector. If yes, is there a better way of emulating it than PXOR with a vector of all 1s? Quite annoying since I need to set up a vector of all 1s to use that approach.

Why cannot the load part of the atomic RMW instruction pass the earlier store to unrelated location in TSO(x86) memory consistency model?

It's known that x86 architecture doesn't implement sequential consistency memory model because of usage of write buffers, so that store->load reordering can take place (later loads can be committed while the earlier stores still reside in write buffers waiting for the commit to L1 cache). In A Primer on Memory Consistency and Coherence we can read about Read-Modify-Write(RMW) operations in Total Store Order(TSO) memory consistency model (which is supposed to be very similar to x86): ... we

X86 hyperledger sawtooth lake -- Intel only or not?

I understand that hyperledger sawtooth lake uses new secure CPU instructions to achieve Proof of elapsed time (PoET) Does this mean that hyperledger sawtooth lake can only be used with Intel hardware? Can other chips be used?

X86 Why shouldn't I catch Undefined Instruction exception instead of using CPUID?

Assume I want to use an instruction that may be not available. And this instruction is not of those transparent fallback, it is undefined instruction when it is not available. Say it is popcnt for example. Can I instead of using cpuid just try and call it? If it fails, I'll catch the exception, and save this information in a bool variable, and will use a different branch further on. Sure there would be performance penalty, but just once. Any additional disadvantages of this approach?

  1    2   3   4   5  ... 下一页 共 5 页